Chapter 9 Flashcards
Basic Hardware
Main memory and the registers built into each processing core are the only
general-purpose storage that the CPU can access directly. There are machine
instructions that take memory addresses as arguments, but none that take disk
addresses. Therefore, any instructions in execution, and any data being used
by the instructions, must be in one of these direct-access storage devices. If the
data are not in memory, they must be moved there before the CPU can operate
on them.
Registers that are built into each CPU core are generally accessible within
one cycle of the CPU clock. Some CPU cores can decode instructions and perform
simple operations on register contents at the rate of one or more operations
per clock tick. The same cannot be said ofmainmemory, which is accessed
via a transaction on the memory bus. Completing a memory access may take
many cycles of the CPU clock. In such cases, the processor normally needs to
stall, since it does not have the data required to complete the instruction that it
is executing. This situation is intolerable because of the frequency of memory
accesses. The remedy is to add fast memory between the CPU and main memory,
typically on the CPU chip for fast access. Such a cache was described in
Section 1.5.5. To manage a cache built into the CPU, the hardware automatically
speeds up memory access without any operating-system control. (Recall from
Section 5.5.2 that during a memory stall, a multithreaded core can switch from
the stalled hardware thread to another hardware thread.)
Not only are we concerned with the relative speed of accessing physical
memory, but we also must ensure correct operation. For proper system
operation, we must protect the operating system from access by user processes,
as well as protect user processes fromone another. This protection must
be provided by the hardware, because the operating system doesn’t usually
intervene between the CPU and its memory accesses (because of the resulting
performance penalty). Hardware implements this production in several different
ways, as we show throughout the chapter. Here, we outline one possible
implementation.
We first need to make sure that each process has a separate memory space.
Separate per-process memory space protects the processes fromeach other and
is fundamental to having multiple processes loaded in memory for concurrent
execution. To separate memory spaces, we need the ability to determine the
range of legal addresses that the process may access and to ensure that the
process can access only these legal addresses. We can provide this protection
by using two registers, usually a base and a limit, as illustrated in Figure 9.1.
The base register holds the smallest legal physical memory address; the limit
register specifies the size of the range. For example, if the base register holds
300040 and the limit register is 120900, then the program can legally access all
addresses from 300040 through 420939 (inclusive).
Protection of memory space is accomplished by having the CPU hardware
compare every address generated in user modewith the registers.Any attempt
by a program executing in user mode to access operating-system memory or
other users’ memory results in a trap to the operating system, which treats the
attempt as a fatal error (Figure 9.2). This scheme prevents a user program from
(accidentally or deliberately) modifying the code or data structures of either
the operating system or other users.
The base and limit registers can be loaded only by the operating system,
which uses a special privileged instruction. Since privileged instructions can
be executed only in kernel mode, and since only the operating system executes
in kernel mode, only the operating system can load the base and limit registers.
This scheme allows the operating system to change the value of the registers
but prevents user programs from changing the registers’ contents.
The operating system, executing in kernel mode, is given unrestricted
access to both operating-system memory and users’ memory. This provision
allows the operating system to load users’ programs into users’ memory, to
dump out those programs in case of errors, to access and modify parameters
of system calls, to perform I/O to and from user memory, and to provide
many other services. Consider, for example, that an operating system for a
multiprocessing system must execute context switches, storing the state of one
process from the registers into main memory before loading the next process’s
context from main memory into the registers.
Address Binding
Usually, a program resides on a disk as a binary executable file. To run, the
program must be brought into memory and placed within the context of a
process (as described in Section 2.5), where it becomes eligible for execution
on an available CPU. As the process executes, it accesses instructions and data
from memory. Eventually, the process terminates, and its memory is reclaimed
for use by other processes.
Most systems allow a user process to reside in any part of the physical
memory. Thus, although the address space of the computer may start at 00000,
the first address of the user process need not be 00000. You will see later how
the operating system actually places a process in physical memory.
In most cases, a user program goes through several steps—some of which
may be optional—before being executed (Figure 9.3). Addresses may be represented
in different ways during these steps. Addresses in the source program
are generally symbolic (such as the variable count).Acompiler typically binds
these symbolic addresses to relocatable addresses (such as “14 bytes from the
beginning of this module”). The linker or loader (see Section 2.5) in turn binds
the relocatable addresses to absolute addresses (such as 74014). Each binding
is a mapping from one address space to another.
Classically, the binding of instructions and data to memory addresses can
be done at any step along the way:
* Compile time. If you know at compile time where the processwill reside in
memory, then absolute code can be generated. For example, if you know
that a user process will reside starting at location R, then the generated
compiler code will start at that location and extend up from there. If, at
some later time, the starting location changes, then it will be necessary to
recompile this code.
* Load time. If it is not known at compile time where the process will reside
in memory, then the compiler must generate relocatable code. In this case,
final binding is delayed until load time. If the starting address changes, we
need only reload the user code to incorporate this changed value.
* Execution time. If the process can be moved during its execution from one
memory segment to another, then binding must be delayed until run time.
Special hardware must be available for this scheme to work, as will be
discussed in Section 9.1.3. Most operating systems use this method.
Amajor portion of this chapter is devoted to showing how these various bindings
can be implemented effectively in a computer system and to discussing
appropriate hardware support.
Logical Versus Physical Address Space
An address generated by the CPU is commonly referred to as a logical address,
whereas an address seen by the memory unit—that is, the one loaded into the memory-address register of the memory—is commonly referred to as a
physical address.
Binding addresses at either compile or load time generates identical logical
and physical addresses. However, the execution-time address-binding scheme
results in differing logical and physical addresses. In this case,we usually refer
to the logical address as a virtual address. We use logical address and virtual
address interchangeably in this text. The set of all logical addresses generated
by a program is a logical address space. The set of all physical addresses
corresponding to these logical addresses is a physical address space. Thus, in
the execution-time address-binding scheme, the logical and physical address
spaces differ.
The run-time mapping from virtual to physical addresses is done by a
hardware device called the memory-management unit (MMU) (Figure 9.4).
We can choose from many different methods to accomplish such mapping, as
we discuss in Section 9.2 through Section 9.3. For the time being, we illustrate
this mapping with a simple MMU scheme that is a generalization of the baseregister
scheme described in Section 9.1.1. The base register is now called
a relocation register. The value in the relocation register is added to every
address generated by a user process at the time the address is sent to memory
(see Figure 9.5). For example, if the base is at 14000, then an attempt by the
user to address location 0 is dynamically relocated to location 14000; an access
to location 346 is mapped to location 14346.
The user program never accesses the real physical addresses. The program
can create a pointer to location 346, store it in memory, manipulate it, and compare
it with other addresses—all as the number 346. Only when it is used as a
memory address (in an indirect load or store, perhaps) is it relocated relative to
the base register. The user program deals with logical addresses. The memory mapping
hardware converts logical addresses into physical addresses. This
form of execution-time binding was discussed in Section 9.1.2. The final location
of a referenced memory address is not determined until the reference is
made.
We now have two different types of addresses: logical addresses (in the
range 0 to max) and physical addresses (in the range R + 0 to R + max for a base
value R). The user program generates only logical addresses and thinks that
the process runs in memory locations from 0 to max. However, these logical
addresses must be mapped to physical addresses before they are used. The concept of a logical address space that is bound to a separate physical address
space is central to proper memory management.
Dynamic Loading
In our discussion so far, it has been necessary for the entire program and all
data of a process to be in physical memory for the process to execute. The size
of a process has thus been limited to the size of physical memory. To obtain
better memory-space utilization, we can use dynamic loading. With dynamic
loading, a routine is not loaded until it is called. All routines are kept on disk
in a relocatable load format. The main program is loaded into memory and
is executed. When a routine needs to call another routine, the calling routine
first checks to see whether the other routine has been loaded. If it has not, the
relocatable linking loader is called to load the desired routine into memory and
to update the program’s address tables to reflect this change. Then control is
passed to the newly loaded routine.
The advantage of dynamic loading is that a routine is loaded only when it
is needed. This method is particularly useful when large amounts of code are
needed to handle infrequently occurring cases, such as error routines. In such
a situation, although the total program size may be large, the portion that is
used (and hence loaded) may be much smaller.
Dynamic loading does not require special support from the operating
system. It is the responsibility of the users to design their programs to take
advantage of such a method. Operating systems may help the programmer,
however, by providing library routines to implement dynamic loading.
Dynamic Linking and Shared Libraries
Dynamically linked libraries (DLLs) are system libraries that are linked to user
programswhen the programs are run (refer back to Figure 9.3). Some operating
systems support only static linking, in which system libraries are treated like any other object module and are combined by the loader into the binary
program image. Dynamic linking, in contrast, is similar to dynamic loading.
Here, though, linking, rather than loading, is postponed until execution time.
This feature is usually used with system libraries, such as the standard C
language library.Without this facility, each program on a system must include a
copy of its language library (or at least the routines referenced by the program)
in the executable image. This requirement not only increases the size of an
executable image but also may waste main memory. A second advantage of
DLLs is that these libraries can be shared among multiple processes, so that
only one instance of the DLL in main memory. For this reason, DLLs are also
known as shared libraries, and are used extensively in Windows and Linux
systems.
When a program references a routine that is in a dynamic library, the loader
locates the DLL, loading it into memory if necessary. It then adjusts addresses
that reference functions in the dynamic library to the location in memorywhere
the DLL is stored.
Dynamically linked libraries can be extended to library updates (such as
bug fixes). In addition, a library may be replaced by a new version, and all
programs that reference the library will automatically use the new version.
Without dynamic linking, all such programs would need to be relinked to gain
access to the new library. So that programs will not accidentally execute new,
incompatible versions of libraries, version information is included in both the
program and the library. More than one version of a library may be loaded
into memory, and each program uses its version information to decide which
copy of the library to use. Versions with minor changes retain the same version
number, whereas versions with major changes increment the number. Thus,
only programs that are compiled with the new library version are affected by
any incompatible changes incorporated in it. Other programs linked before the
new library was installed will continue using the older library.
Unlike dynamic loading, dynamic linking and shared libraries generally
require help from the operating system. If the processes in memory are protected
from one another, then the operating system is the only entity that can
check to see whether the needed routine is in another process’s memory space
or that can allow multiple processes to access the same memory addresses.
We elaborate on this concept, as well as how DLLs can be shared by multiple
processes, when we discuss paging in Section 9.3.4.
Contiguous Memory Allocation
The main memory must accommodate both the operating system and the
various user processes. We therefore need to allocate main memory in the
most efficient way possible. This section explains one early method, contiguous
memory allocation.
The memory is usually divided into two partitions: one for the operating
system and one for the user processes. We can place the operating system
in either low memory addresses or high memory addresses. This decision
depends on many factors, such as the location of the interrupt vector. However,
many operating systems (including Linux and Windows) place the operating
system in high memory, and therefore we discuss only that situation. We usually want several user processes to reside in memory at the same
time. We therefore need to consider how to allocate available memory to the
processes that are waiting to be brought into memory. In contiguous memory
allocation, each process is contained in a single section of memory that
is contiguous to the section containing the next process. Before discussing
this memory allocation scheme further, though, we must address the issue of
memory protection.
Memory Protection
We can prevent a process from accessing memory that it does not own by
combining two ideas previously discussed. If we have a system with a relocation
register (Section 9.1.3), together with a limit register (Section 9.1.1), we
accomplish our goal. The relocation register contains the value of the smallest
physical address; the limit register contains the range of logical addresses (for
example, relocation = 100040 and limit = 74600). Each logical address must
fall within the range specified by the limit register. The MMU maps the logical
address dynamically by adding the value in the relocation register. This
mapped address is sent to memory (Figure 9.6).
When the CPU scheduler selects a process for execution, the dispatcher
loads the relocation and limit registers with the correct values as part of the
context switch. Because every address generated by a CPU is checked against
these registers, we can protect both the operating system and the other users’
programs and data from being modified by this running process.
The relocation-register scheme provides an effective way to allow the operating
system’s size to change dynamically. This flexibility is desirable in many
situations. For example, the operating system contains code and buffer space
for device drivers. If a device driver is not currently in use, it makes little
sense to keep it in memory; instead, it can be loaded into memory only when
it is needed. Likewise, when the device driver is no longer needed, it can be
removed and its memory allocated for other needs.
Memory Allocation
Nowwe are ready to turn to memory allocation. One of the simplest methods of
allocating memory is to assign processes to variably sized partitions in memory,
where each partition may contain exactly one process. In this variablepartition
scheme, the operating system keeps a table indicating which parts of
memory are available and which are occupied. Initially, all memory is available
for user processes and is considered one large block of available memory, a
hole. Eventually, as you will see, memory contains a set of holes of various
sizes.
Figure 9.7 depicts this scheme. Initially, the memory is fully utilized, containing
processes 5, 8, and 2. After process 8 leaves, there is one contiguous
hole. Later on, process 9 arrives and is allocated memory. Then process 5
departs, resulting in two noncontiguous holes.
As processes enter the system, the operating system takes into account the
memory requirements of each process and the amount of available memory
space in determining which processes are allocated memory. When a process
is allocated space, it is loaded into memory, where it can then compete for CPU
time. When a process terminates, it releases its memory, which the operating
system may then provide to another process.
What happens when there isn’t sufficient memory to satisfy the demands
of an arriving process? One option is to simply reject the process and provide
an appropriate error message. Alternatively, we can place such processes into
a wait queue.When memory is later released, the operating system checks the
wait queue to determine if it will satisfy the memory demands of a waiting
process.
In general, as mentioned, the memory blocks available comprise a set of
holes of various sizes scattered throughout memory. When a process arrives
and needs memory, the system searches the set for a hole that is large enough
for this process. If the hole is too large, it is split into two parts. One part is
allocated to the arriving process; the other is returned to the set of holes. When
a process terminates, it releases its block of memory, which is then placed back
in the set of holes. If the new hole is adjacent to other holes, these adjacent holes
are merged to form one larger hole.
This procedure is a particular instance of the general dynamic storage allocation
problem, which concerns how to satisfy a request of size n from a
list of free holes. There are many solutions to this problem. The first-fit, best-fi ,
and worst-fi strategies are the ones most commonly used to select a free hole
from the set of available holes.
- First fit. Allocate the first hole that is big enough. Searching can start either
at the beginning of the set of holes or at the location where the previous
first-fit search ended.We can stop searching as soon as we find a free hole
that is large enough. - Best fi . Allocate the smallest hole that is big enough. We must search the
entire list, unless the list is ordered by size. This strategy produces the
smallest leftover hole. - Worst fit. Allocate the largest hole. Again, we must search the entire list,
unless it is sorted by size. This strategy produces the largest leftover hole,
which may be more useful than the smaller leftover hole from a best-fit
approach.
Simulations have shown that both first fit and best fit are better than worst
fit in terms of decreasing time and storage utilization. Neither first fit nor best
fit is clearly better than the other in terms of storage utilization, but first fit is
generally faster.
Fragmentation
Both the first-fit and best-fit strategies for memory allocation suffer from external
fragmentation. As processes are loaded and removed from memory, the
free memory space is broken into little pieces. External fragmentation exists
when there is enough total memory space to satisfy a request but the available
spaces are not contiguous: storage is fragmented into a large number of small
holes. This fragmentation problem can be severe. In the worst case, we could
have a block of free (or wasted) memory between every two processes. If all
these small pieces of memory were in one big free block instead, we might be
able to run several more processes.
Whether we are using the first-fit or best-fit strategy can affect the amount
of fragmentation. (First fit is better for some systems, whereas best fit is better
for others.) Another factor is which end of a free block is allocated. (Which is
the leftover piece—the one on the top or the one on the bottom?) No matter
which algorithm is used, however, external fragmentation will be a problem.
Depending on the total amount of memory storage and the average process
size, external fragmentation may be a minor or a major problem. Statistical
analysis of first fit, for instance, reveals that, even with some optimization,
given N allocated blocks, another 0.5 N blocks will be lost to fragmentation.
That is, one-third of memory may be unusable! This property is known as the
50-percent rule.
Memory fragmentation can be internal as well as external. Consider a
multiple-partition allocation scheme with a hole of 18,464 bytes. Suppose that
the next process requests 18,462 bytes. If we allocate exactly the requested
block, we are left with a hole of 2 bytes. The overhead to keep track of this
hole will be substantially larger than the hole itself. The general approach
to avoiding this problem is to break the physical memory into fixed-sized
blocks and allocate memory in units based on block size. With this approach,
the memory allocated to a process may be slightly larger than the requested
memory. The difference between these two numbers is internal fragmentation
—unused memory that is internal to a partition. One solution to the problem of external fragmentation is compaction. The
goal is to shuffle the memory contents so as to place all free memory together
in one large block. Compaction is not always possible, however. If relocation
is static and is done at assembly or load time, compaction cannot be done. It is
possible only if relocation is dynamic and is done at execution time. If addresses
are relocated dynamically, relocation requires only moving the program and
data and then changing the base register to reflect the new base address. When
compaction is possible, we must determine its cost. The simplest compaction
algorithm is to move all processes toward one end of memory; all holes move in
the other direction, producing one large hole of available memory. This scheme
can be expensive.
Another possible solution to the external-fragmentation problem is to permit
the logical address space of processes to be noncontiguous, thus allowing a
process to be allocated physical memory wherever such memory is available.
This is the strategy used in paging, the most common memory-management
technique for computer systems. We describe paging in the following section.
Fragmentation is a general problem in computing that can occur wherever
we must manage blocks of data. We discuss the topic further in the storage
management chapters (Chapter 11 through Chapter 15).
Paging
Memory management discussed thus far has required the physical address
space of a process to be contiguous. We now introduce paging, a memory management
scheme that permits a process’s physical address space to be noncontiguous.
Paging avoids external fragmentation and the associated need for
compaction, two problems that plague contiguous memory allocation. Because
it offers numerous advantages, paging in its various forms is used in most operating
systems, from those for large servers through those for mobile devices.
Paging is implemented through cooperation between the operating system and
the computer hardware.
The basic method for implementing paging
involves breaking physical memory
into fixed-sized blocks called frames and breaking logical memory into
blocks of the same size called pages.When a process is to be executed, its pages
are loaded into any available memory frames fromtheir source (a file system or
the backing store). The backing store is divided into fixed-sized blocks that are
the same size as the memory frames or clusters of multiple frames. This rather
simple idea has great functionality and wide ramifications. For example, the
logical address space is now totally separate from the physical address space,
so a process can have a logical 64-bit address space even though the system has
less than 264 bytes of physical memory.
Every address generated by the CPU is divided into two parts: a page
number (p) and a page offset (d): The page number is used as an index into a per-process page table. This is
illustrated in Figure 9.8. The page table contains the base address of each frame
in physicalmemory, and the offset is the location in the frame being referenced.
Thus, the base address of the frame is combined with the page offset to define
the physical memory address. The paging model of memory is shown in Figure
9.9.
The following outlines the steps taken by the MMU to translate a logical
address generated by the CPU to a physical address:
1. Extract the page number p and use it as an index into the page table.
2. Extract the corresponding frame number f fromthe page table.
3. Replace the page number p in the logical address with the frame number
f .
As the offset d does not change, it is not replaced, and the frame number and
offset now comprise the physical address.
The page size (like the frame size) is defined by the hardware. The size
of a page is a power of 2, typically varying between 4 KB and 1 GB per page,
depending on the computer architecture. The selection of a power of 2 as a
page size makes the translation of a logical address into a page number and
page offset particularly easy. If the size of the logical address space is 2m, and a
page size is 2n bytes, then the high-orderm−n bits of a logical address designate
the page number, and the n low-order bits designate the page offset. Thus, the
logical address is as follows: where p is an index into the page table and d is the displacement within the
page.
As a concrete (although minuscule) example, consider the memory in
Figure 9.10. Here, in the logical address, n = 2 and m = 4. Using a page size
of 4 bytes and a physical memory of 32 bytes (8 pages), we show how the
programmer’s view of memory can be mapped into physical memory. Logical
address 0 is page 0, offset 0. Indexing into the page table, we find that page 0
is in frame 5. Thus, logical address 0 maps to physical address 20 [= (5 × 4) +
0]. Logical address 3 (page 0, offset 3) maps to physical address 23 [= (5 × 4) +
3]. Logical address 4 is page 1, offset 0; according to the page table, page 1 is
mapped to frame 6. Thus, logical address 4 maps to physical address 24 [= (6
× 4) + 0]. Logical address 13 maps to physical address 9.
You may have noticed that paging itself is a form of dynamic relocation.
Every logical address is bound by the paging hardware to some physical
address. Using paging is similar to using a table of base (or relocation) registers,
one for each frame of memory.
When we use a paging scheme, we have no external fragmentation: any
free frame can be allocated to a process that needs it. However, we may have
some internal fragmentation. Notice that frames are allocated as units. If the
memory requirements of a process do not happen to coincidewith page boundaries,
the last frame allocated may not be completely full. For example, if page
size is 2,048 bytes, a process of 72,766 byteswill need 35 pages plus 1,086 bytes.
It will be allocated 36 frames, resulting in internal fragmentation of 2,048 −
1,086 = 962 bytes. In the worst case, a process would need n pages plus 1 byte.
Itwouldbe allocatedn + 1 frames, resulting in internal fragmentation of almost
an entire frame. If process size is independent of page size, we expect internal fragmentation
to average one-half page per process. This consideration suggests that
small page sizes are desirable. However, overhead is involved in each pagetable
entry, and this overhead is reduced as the size of the pages increases. Also,
disk I/O is more efficient when the amount of data being transferred is larger
(Chapter 11). Generally, page sizes have grown over time as processes, data
sets, and main memory have become larger. Today, pages are typically either
4 KB or 8 KB in size, and some systems support even larger page sizes. Some
CPUs and operating systems even support multiple page sizes. For instance, on
x86-64 systems,Windows 10 supports page sizes of 4 KB and 2 MB. Linux also
supports twopage sizes: a default page size (typically 4 KB) and an architecturedependent
larger page size called huge pages.
Frequently, on a 32-bit CPU, each page-table entry is 4 bytes long, but that
size can vary as well. A 32-bit entry can point to one of 232 physical page
frames. If the frame size is 4 KB (212), then a system with 4-byte entries can
address 244 bytes (or 16 TB) of physical memory. We should note here that the
size of physical memory in a paged memory system is typically different from
the maximum logical size of a process. As we further explore paging, we will introduce other information that must be kept in the page-table entries. That
information reduces the number of bits available to address page frames. Thus,
a system with 32-bit page-table entries may address less physical memory than
the possible maximum.
When a process arrives in the system to be executed, its size, expressed
in pages, is examined. Each page of the process needs one frame. Thus, if the
process requires n pages, at least n frames must be available in memory. If n
frames are available, they are allocated to this arriving process. The first page
of the process is loaded into one of the allocated frames, and the frame number
is put in the page table for this process. The next page is loaded into another
frame, its frame number is put into the page table, and so on (Figure 9.11).
An important aspect of paging is the clear separation between the programmer’s
view of memory and the actual physical memory. The programmer
views memory as one single space, containing only this one program. In fact,
the user program is scattered throughout physical memory, which also holds other programs. The difference between the programmer’s view of memory
and the actual physical memory is reconciled by the address-translation hardware.
The logical addresses are translated into physical addresses. This mapping
is hidden fromthe programmer and is controlled by the operating system.
Notice that the user process by definition is unable to access memory it does
not own. It has no way of addressing memory outside of its page table, and the
table includes only those pages that the process owns.
Since the operating system is managing physical memory, it must be aware
of the allocation details of physical memory—which frames are allocated,
which frames are available, how many total frames there are, and so on. This
information is generally kept in a single, system-wide data structure called
a frame table. The frame table has one entry for each physical page frame,
indicating whether the latter is free or allocated and, if it is allocated, to which
page of which process (or processes).
In addition, the operating system must be aware that user processes operate
in user space, and all logical addresses must be mapped to produce physical
addresses. If a user makes a system call (to do I/O, for example) and provides
an address as a parameter (a buffer, for instance), that addressmust bemapped
to produce the correct physical address. The operating system maintains a copy
of the page table for each process, just as it maintains a copy of the instruction
counter and register contents. This copy is used to translate logical addresses to
physical addresses whenever the operating system must map a logical address
to a physical address manually. It is also used by the CPU dispatcher to define
the hardware page table when a process is to be allocated the CPU. Paging
therefore increases the context-switch time.
Hardware Support
As page tables are per-process data structures, a pointer to the page table
is stored with the other register values (like the instruction pointer) in the
process control block of each process. When the CPU scheduler selects a process
for execution, it must reload the user registers and the appropriate hardware
page-table values from the stored user page table.
The hardware implementation of the page table can be done in several
ways. In the simplest case, the page table is implemented as a set of dedicated
high-speed hardware registers,whichmakes the page-address translation very
efficient. However, this approach increases context-switch time, as each one of
these registers must be exchanged during a context switch.
The use of registers for the page table is satisfactory if the page table is reasonably
small (for example, 256 entries). Most contemporary CPUs, however,
support much larger page tables (for example, 220 entries). For these machines,
the use of fast registers to implement the page table is not feasible. Rather,
the page table is kept in main memory, and a page-table base register (PTBR)
points to the page table. Changing page tables requires changing only this one
register, substantially reducing context-switch time.
Translation Look-Aside Buffer
Although storing the page table in main memory can yield faster context
switches, it may also result in slower memory access times. Suppose we want
to access location i. We must first index into the page table, using the value in the PTBR offset by the page number for i. This task requires one memory access.
It provides us with the frame number, which is combined with the page offset
to produce the actual address.We can then access the desired place in memory.
With this scheme, two memory accesses are needed to access data (one for the
page-table entry and one for the actual data). Thus, memory access is slowed by
a factor of 2, a delay that is considered intolerable under most circumstances.
The standard solution to this problem is to use a special, small, fast-lookup
hardware cache called a translation look-aside buffer (TLB). The TLB is associative,
high-speed memory. Each entry in the TLB consists of two parts: a key
(or tag) and a value. When the associative memory is presented with an item,
the item is compared with all keys simultaneously. If the item is found, the corresponding
value field is returned. The search is fast; a TLB lookup in modern
hardware is part of the instruction pipeline, essentially adding no performance
penalty. To be able to execute the search within a pipeline step, however, the
TLB must be kept small. It is typically between 32 and 1,024 entries in size. Some
CPUs implement separate instruction and data address TLBs. That can double
the number of TLB entries available, because those lookups occur in different
pipeline steps.We can see in this development an example of the evolution of
CPU technology: systems have evolved from having no TLBs to havingmultiple
levels of TLBs, just as they have multiple levels of caches.
The TLB is used with page tables in the following way. The TLB contains
only a few of the page-table entries. When a logical address is generated by
the CPU, the MMU first checks if its page number is present in the TLB. If the
page number is found, its frame number is immediately available and is used
to access memory. As just mentioned, these steps are executed as part of the
instruction pipelinewithin the CPU, adding no performance penalty compared
with a system that does not implement paging.
If the page number is not in the TLB (known as a TLB miss), address
translation proceeds following the steps illustrated in Section 9.3.1, where a
memory reference to the page table must be made. When the frame number is
obtained, we can use it to access memory (Figure 9.12). In addition,we add the
page number and frame number to the TLB, so that they will be found quickly
on the next reference.
If the TLB is already full of entries, an existing entry must be selected
for replacement. Replacement policies range from least recently used (LRU)
through round-robin to random. Some CPUs allow the operating system to participate
in LRU entry replacement, while others handle the matter themselves.
Furthermore, some TLBs allow certain entries to be wired down, meaning that
they cannot be removed from the TLB. Typically, TLB entries for key kernel code
are wired down.
Some TLBs store address-space identifier (ASIDs) in each TLB entry. An
ASID uniquely identifies each process and is used to provide address-space
protection for that process. When the TLB attempts to resolve virtual page numbers,
it ensures that the ASID for the currently running process matches the
ASID associated with the virtual page. If the ASIDs do not match, the attempt
is treated as a TLB miss. In addition to providing address-space protection, an
ASID allows the TLB to contain entries for several different processes simultaneously.
If the TLB does not support separate ASIDs, then every time a new
page table is selected (for instance, with each context switch), the TLB must be
flushe (or erased) to ensure that the next executing process does not use the wrong translation information. Otherwise, the TLB could include old entries
that contain valid virtual addresses but have incorrect or invalid physical
addresses left over from the previous process.
The percentage of times that the page number of interest is found in the
TLB is called the hit ratio. An 80-percent hit ratio, for example, means that
we find the desired page number in the TLB 80 percent of the time. If it takes
10 nanoseconds to access memory, then a mapped-memory access takes 10
nanoseconds when the page number is in the TLB. If we fail to find the page
number in the TLB then we must first access memory for the page table and
frame number (10 nanoseconds) and then access the desired byte in memory
(10 nanoseconds), for a total of 20 nanoseconds. (We are assuming that a pagetable
lookup takes only one memory access, but it can take more, as we shall
see.) To find the effective memory-access time, we weight the case by its
probability:
effective access time = 0.80 × 10 + 0.20 × 20
= 12 nanoseconds
In this example, we suffer a 20-percent slowdown in average memory-access
time (from 10 to 12 nanoseconds). For a 99-percent hit ratio, which is much
more realistic, we have
effective access time = 0.99 × 10 + 0.01 × 20
= 10.1 nanoseconds
This increased hit rate produces only a 1 percent slowdown in access time. As noted earlier, CPUs today may provide multiple levels of TLBs. Calculatingmemory
access times inmodern CPUs is therefore much more complicated
than shown in the example above. For instance, the Intel Core i7 CPU has a
128-entry L1 instruction TLB and a 64-entry L1 data TLB. In the case ofamiss at
L1, it takes the CPU six cycles to check for the entry in the L2 512-entry TLB. A
miss in L2 means that the CPU must either walk through the page-table entries
in memory to find the associated frame address, which can take hundreds of
cycles, or interrupt to the operating system to have it do the work.
A complete performance analysis of paging overhead in such a system
would require miss-rate information about each TLB tier. We can see from the
general information above, however, that hardware features can have a significant
effect onmemory performance and that operating-system improvements
(such as paging) can result in and, in turn, be affected by hardware changes
(such as TLBs).We will further explore the impact of the hit ratio on the TLB in
Chapter 10.
TLBs are a hardware feature and thereforewould seemto be of little concern
to operating systems and their designers. But the designer needs to understand
the function and features of TLBs, which vary by hardware platform. For optimal
operation, an operating-system design for a given platform must implement
paging according to the platform’s TLB design. Likewise, a change in
the TLB design (for example, between different generations of Intel CPUs) may
necessitate a change in the paging implementation of the operating systems
that use it.
Protection
Memory protection in a paged environment is accomplished by protection bits
associated with each frame. Normally, these bits are kept in the page table.
One bit can define a page to be read–write or read-only. Every reference
to memory goes through the page table to find the correct frame number. At
the same time that the physical address is being computed, the protection bits
can be checked to verify that no writes are being made to a read-only page. An
attempt to write to a read-only page causes a hardware trap to the operating
system (or memory-protection violation).
We can easily expand this approach to provide a finer level of protection.
We can create hardware to provide read-only, read–write, or execute-only
protection; or, by providing separate protection bits for each kind of access, we
can allow any combination of these accesses. Illegal attempts will be trapped
to the operating system.
One additional bit is generally attached to each entry in the page table: a
valid –invalid bit. When this bit is set to valid, the associated page is in the
process’s logical address space and is thus a legal (or valid) page. When the
bit is set to invalid, the page is not in the process’s logical address space. Illegal
addresses are trapped by use of the valid–invalid bit. The operating system
sets this bit for each page to allow or disallow access to the page.
Suppose, for example, that in a system with a 14-bit address space (0 to
16383), we have a program that should use only addresses 0 to 10468. Given
a page size of 2 KB, we have the situation shown in Figure 9.13. Addresses in
pages 0, 1, 2, 3, 4, and 5 are mapped normally through the page table. Any
attempt to generate an address in pages 6 or 7, however, will find that the valid–invalid bit is set to invalid, and the computer will trap to the operating
system (invalid page reference).
Notice that this scheme has created a problem. Because the program
extends only to address 10468, any reference beyond that address is illegal.
However, references to page 5 are classified as valid, so accesses to addresses
up to 12287 are valid. Only the addresses from 12288 to 16383 are invalid. This
problem is a result of the 2-KB page size and reflects the internal fragmentation
of paging.
Rarely does a process use all its address range. In fact, many processes
use only a small fraction of the address space available to them. It would be
wasteful in these cases to create a page table with entries for every page in
the address range. Most of this table would be unused but would take up
valuable memory space. Some systems provide hardware, in the form of a
page-table length register (PTLR), to indicate the size of the page table. This
value is checked against every logical address to verify that the address is in
the valid range for the process. Failure of this test causes an error trap to the
operating system.
Shared Pages
Anadvantage of paging is the possibility of sharing common code, a consideration
that is particularly important in an environment with multiple processes.
Consider the standard C library, which provides a portion of the system call
interface for many versions of UNIX and Linux. On a typical Linux system,
most user processes require the standard C library libc. One option is to have
each process load its own copy of libc into its address space. If a system has
40 user processes, and the libc library is 2 MB, this would require 80 MB of
memory.
If the code is reentrant code, however, it can be shared, as shown in Figure
9.14. Here, we see three processes sharing the pages for the standard C library
libc. (Although the figure shows the libc library occupying four pages, in
reality, it would occupy more.) Reentrant code is non-self-modifying code: it
never changes during execution. Thus, two or more processes can execute the
same code at the same time. Each process has its own copy of registers and data
storage to hold the data for the process’s execution. The data for two different
processes will, of course, be different. Only one copy of the standard C library
need be kept in physicalmemory, and the page table for each user processmaps
onto the same physical copy of libc. Thus, to support 40 processes, we need
only one copy of the library, and the total space now required is 2 MB instead
of 80 MB—a significant saving!
In addition to run-time libraries such as libc, other heavily used programs
can also be shared—compilers,window systems, database systems, and so on.
The shared libraries discussed in Section 9.1.5 are typically implemented with
shared pages. To be sharable, the code must be reentrant. The read-only nature
of shared code should not be left to the correctness of the code; the operating
system should enforce this property. The sharing of memory among processes on a system is similar to the
sharing of the address space of a task by threads, described in Chapter 4.
Furthermore, recall that in Chapter 3 we described shared memory as a method
of interprocess communication. Some operating systems implement shared
memory using shared pages.
Organizing memory according to pages provides numerous benefits in
addition to allowing several processes to share the same physical pages. We
cover several other benefits in Chapter 10.