Memory Management Flashcards
mono-programming
no memory abstraction
Naive approach: Move each program to a dedicated region
Relocation problem
base register
operates dynamic allocation
limit register
enforces protection
swapping issue
may lead to memory fragmentation → Memory compaction → empty chunks to small to be used
problem: allowing extra room for process growth
issue: how much room?
memory usage vs. OOM risk
What to do on OOMs?
kill process
relocate process
swap out
bitmap memory allocation
finding holes requires slow scanning
slow in finding holes of a certain length
list memory allocation
slow allocation, slow deallocation
holes sorted by address for fast coalescing
linked list memory management
- In practice, a doubly linked is often used ⇒ makes freeing easy
- Can easily check if previous free
- Can easily adjust the pointers
first fit
take first fitting hole
next fit
take next fitting hole
slower than first fit in practice
best fit
take best fitting hole
prone to fragmentation
worst fit
take worst fitting hole -> maximise remains
poor performance in practice
quick fit
keep holes of different size
poor coalescing performance
buddy allocation scheme
improves quick fit’s coalescing performance
buddy allocator
Fast and limits external fragmentation but considerable internal fragmentation
Often combined with other allocators
So slab allocator gets memory from buddy and parcels this out to callers of kmalloc()
slab allocator
- Allocation + freeing of objects of the same type: very common in OS
- Speed is of the essence
- Unfortunately, allocating, initializing, and destroying objects all take timeE.g., list of free chunks → allocation requires traversing list to find a chunk of sufficient size
- Goal:
- make metadata for allocator (e.g., to track what memory is free) very fast
- cache as much as possible
SLABs
- slab allocator supports memory allocation for kernel
- kernel needs many different temporary objects: e.g. dentry
- temporary kernel objects
- can be both small and large
- are often allocated and often freed (efficiency is crucial)
- the buddy allocator, which operates with areas composed of entire frames of memory, is not suitable for this
paging
- Divide physical and virtual memory into pages of fixed size (e.g. 4096 bytes)
- Translate virtual pages into physical pages (frames)
total physical memory size
page frame * number of frames
32 bit systems
- 32 bit address space and 32 page table entries (PTEs)
- 20 bits for address + 12 to encode other things
64-bit systems
- really: 48-bit or 57-bit address space and 64-bit PTEs
- 2^48 bytes
two level page table (32 bit)
20 bits ⇒ split into two 10 bit parts: the first 10 bits are used as an index to the top-level page table; the second 10 bits are used as an index to the second-level page table
four level page table
page global directory → page upper directory → page midlevel directory → pte → page
each pte points to the base of the next level → 9 bits give an offset
5 memory accesses needed for each memory access: each table address and the page itself
inverted page tables
Translation Lookaside Buffer
- Problem: MMU has to translate every single memory access → Performance problem
- Solution: Cache previous translations(virtual to physical), hoping programs exhibit a high degree of locality.
How many TLBs?
- Super expensive memory: small number of entries (e.g., 64-128)
- Sometimes separate entries for huge (2MB) pages and normal (4KB) pages
- Often separate TLBs for code and data
- 1 per every CPU, but sometimes multiple levels of TLB (L1 TLB vs L2 TLB)
- separate TLBS for code and data ⇒ 2 per core
When to flush TLB?
- When entries change
- Traditionally: at context switch
- when you change virtual to physical mapping i.e. when you change the process
- Improvement: tag TLB entries with id of process (no need to flush if no other process uses your id) → only flush entries when new data is being accessed but not at context switch
When to expect many TLB misses?
- after flushes → no entries to be found
- when the program exhibits little locally - very little addresses in the same area
When not to flush?
global flag set
Software- vs. hardware-managed TLB
- Flexibility vs. EfficiencyE.g., OS may preload TLB entries that it expects to need later (say the entries for a server to which current process sends msg)
TLB miss handling
- Walk page tables to find the mapping
- If mapping found, fill new TLB entry (soft miss)
- If mapping not found, page fault (hard miss)
OS PF handling, cases
- Access violation (segmentation fault)
- Legal access, PF handler needs to fix up page tables:
- The page is already in memory (minor PF) OR
- The page must be fetched from disk (major PF)
Page Replacement hw support
PTE bits: modified and referenced
pr optimal
- Replace page that will be referenced as far in the future as possible
- Can this algorithm be implemented in practice? Nope
pr random
replace a random page
NRU
- Classes of pages:
Class 0: R=0, M=0
Class 1: R=0, M=1
Class 2: R=1, M=0
Class 3: R=1, M=1
(r - referenced, m - modified) - Periodically clear R bit for all the pages (set to 0)
- Replace a page at random, but prioritise class 0, then class 1, etc.
⇒ replace the least expensive one
FIFO
build queue of (faulting) pages and replace the head
pr second chance
- Improved FIFO to preserve important pages
- For each visited page:
- If R=0 then replace page
- If R=1 then set R=0 and move page to
tail
pr clock
- Hand points to oldest page
- When a page fault occurs, the page the hand is pointing to is inspected
- the action depends on the R bit:R = 0 ⇒ evict the pageR = 1⇒ clear R and advance hand
pr LRU
- Idea: Replace page that has been unused for the longest
- Good predictor for optimal in many (but not all) cases
- Can LRU ever perform worse than random?
- HW-assisted solutions:
- Maintain an ordered page list at each reference
- Head is LRU page
- Maintain reference timestamps in PTEs
- Oldest-timestamp PTE points to LRU page
- expensive in hardware ⇒ rarely implemented in practice
- Maintain an ordered page list at each reference
pr NFU
- Maintain per-page counters and periodically add (and clear) the R bit to every counter
- Replace the page with the lowest counter
- Problem: NFU never “forgets” pages
- some pages might have been used a lot before, but are no longer being used ⇒ takes a long time to replace them
pr aging
- Improve NFU by:
- Add R bit to the leftmost bit
- Shifting counters to the right
- Offers lower temporal resolution than “real” LRU
- CLOCK (and variations) more often used in practice
working set
w(k, t)
set of pages that a process is currently using
- the set of pages used by the k most recent memory references
- The function w(k, t) is the size of the working set at time t
If (WS in memory) ->
few page faults
If (memory too small for WS) →
If (memory too small for WS) → thrashing
thrashing
a process is spending more time in paging rather than in executing
Demand paging
- Lazily allocate pages to each process
- Exploits temporal locality to avoid frequent PFs
Pre-paging
- Typically based on the working set model, i.e., the set of pages a process currently needs to execute
- Load the process working set in memory in advance when scheduling the process to avoid frequent PFs
- Need for working set estimation algorithms
bss
block started by simple ⇒ contains all data that doesn’t need to be stored in binary
Local allocation
- Replaces only pages of the current process
- Assumes static process memory allocation size
- Problems:
- A growing working set leads to trashing
- A shrinking working set leads to wasted memory
- memory utilisation is not as good as compared to global
Global allocation
- Replaces pages owned by any process. Strategies:
- Treat every global page equally (or prioritize)
- Dynamic memory balancing using working set size estimation
- Selectively swap out processes (or OOM kill)
- problems:
- a process can’t control its own page fault rate - other processes can “steal” frames
no page fault ⇒
too many page faults ⇒
too much memory available
too little memory
Load Control (design issue)
- Thrashing can be expected when the combined working sets of all processes exceed the capacity of memory
- Out of Memory killer (OOM) selectively kills processes when system is low on memory
- Less drastic approach swaps some processes to nonvolatile storage and releases those pages
- Similar to two-level scheduling
- Can also reduce memory usage via compaction/compression
- Deduplication or same page merging
Page Size (design issue)
- Most paging systems support 4KB pages
- relatively small : limits internal fragmentation / space wastage
- not too small : limits number of pages to handle, entries in TLB
- In addition, many support also larger pages. For instance:
2 MB1GB
Design Issues: Separate Instruction and Data Spaces
- Most computers have a single address space shared by program and data
Ìn the past some systems had a separate address space for instructions and data - Nowadays, we still see separate I and D spaces in caches and TLBs, etc.
Design Issues: Shared Pages
- Sharing improves efficiency.
- When one process exits, OS should realise there is another user of the the page
- Sharing data is trickier than sharing code
- After fork system call parent and child share program and text
- To do so efficiently, each gets its own set of page tables, pointing to the same pages
- Pages are mapped read-only
- When a process writes to such a page:
make a copy
=> copy-on-write
- After fork system call parent and child share program and text
copy-on-write
Design Issues: Shared Libraries
- Sharing libraries is very common
- easy: read-only
- Typically through Dynamic Link Libraries (DLLs)
- often as position-independent code
Design Issues: Memory mapped files
- Shared libs special cases of memory mapped files
- As pages are touched→ demand paged from disk
- When process exits→ write dirty pages back to disk
Design Issues: Cleaning policy
- Paging works well if plenty of free pages. What if we run short?
- Indirect reclaim: Paging daemon sleeps and then evicts pages if free pages are in short supply
- Flush dirty pages
- Leave them in memory – easy reuse
Page Fault Handling
- The hardware traps to kernel, saving program counter on stack.
- Assembly code routine started to save registers and other volatile info. Calls the page fault handler
- System discovers page fault has occurred, tries to discover which virtual page needed
- Once virtual address that caused fault is known, system checks to see if address valid and the protection consistent with access
- If frame selected dirty, page is scheduled for transfer to nonvolatile storage, context switch takes place, suspending faulting process
- As soon as frame clean, operating system looks up disk address where needed page is, schedules disk or SSD operation to bring it in.
- When disk or SSD interrupt indicates page has arrived, tables updated to reflect position, and frame marked as being in normal state.
- Faulting instruction backed up to state it had when it began and program counter is reset
- Faulting process is scheduled, operating system returns to routine that called it.
- Routine reloads registers and other state information, returns to user space to continue execution where it left off
Separation of Policy and Mechanism
the separation is important for managing the complexity of the system.
Memory management system is divided into three parts
- A low-level MMU handler.
- A page fault handler that is part of the kernel.
- An external pager running in user space.
kernel page fault handler
- machine independent
- contains mechanism for paging
external pager
- machine independent
- can implement the policy