Final Exam Flashcards
Pthreads Sync for IPC
• Init'ed w/Pthread_process_shared: - pthread_mutexattr_t - pthread_condattr_t • Sync info must be avail to all processes - create mutex in shared memory region (See video #274 for code sample)
Copy (Messages) vs Shared Memory
Copy (Messages) CPU cycles to copy data to/from port • Smaller messages Map (Shmem) CPU cycles to map mem to addr space CPU to copy data to channel Set up once, use many x || lg block 1 x-> good payoff
Freeing Physical Memory
• When should pages be swapped?
Memory usage above threshold (high watermark)
CPU usage below threshold (low watermark)
• Which pages should be swapped?
History based prediction - Least Recently Used (LRU)
Pages without dirty bit set
Avoid non-swappable pages
Catagorize pages into diff types
Second chance variation of LRU - make 2 passes
before deciding which page to swap
Memory Allocation
Memory Allocator
• Determines VA->PA mapping, checks validity/perm
Kernel Level Allocators
• Allocate pages for kernel state, processes states
• Keep track of free memory
User-level allocators
• Dynamic process state (heap) using malloc/free
• Once malloc’d, kernel not involved. Used by dlmalloc, jemalloc, hoard, tcmalloc
Homogeneous Architectures
• Each node can do any processing step
+ Keeps front end simple - doesn’t mean each node has all data but can get to all data
- How to benefit from caching?
• Scale by adding more nodes
Priority Scheduling
- Tasks have different priority levels
- Run highest priority task next
- Can preempt lower priority tasks
Copy-on-Write
If new process is created, why copy all the memory?
• Instead, map new VA to original page
• Write protect original page
• If only read, this will save memory & time
• On write, page fault and copy/update page tables
• Only copy pages on write => Copy-on-Write
Split Device Driver Model
• Approach: device access ctrl split btwn
front end driver in guest VM (device API)
back-end driver in service VM (or host)
• Modified guest drivers
• Limited to paravirtualized guests
+ Eliminates emul OHs
+ Allow better mgmt of shared devices
Internet Service
• Any service avail through web interface (mail, weather, banking, etc) 3 components: • Presentation - static content • Biz logic - dynamic content • Database tier - data store
Problems with Trap & Emulate
• x86 pre 2005:
4 rings, no root/non-root modes yet
HV = ring 0, guest OS = ring 1
• 17 instructions do not trap, fail silently
• i.e. interupts, POPF/PUSHF
• HV doesn’t know there was an error, assumes successful
Strict consistency
Strict consistency
• Updates are visible everywhere immediately
• In practice, even on SMP there are no guarantees
• Even harder on DSM
• Doesn’t really exist
More Sync Constructs (All need hardware support)
- Serializers - Good w/priorities, hides need for signalling, condition vars
- Path expressions - uses reg exp to define access
- Barriers - Reverse semaphore
- Rendevous Points - Waits for mult threads to need point
- Optimistic Wait Free Sync (RCU) - Bet on no conflict
What are the overheads associated with scheduling? Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long timeslices)?
- Longer timeslice keeps CPU util & throughput high, benefits CPU-bound tasks (fewer cxt switches)
- Short timeslice issues I/O ops sooner, better CPU and device utilization, and better user-perceived performance.
Device Drivers
- Driver needed for each type of hardware
- Resp for device access, mgmt, control
- Provided by device manufactuerer
- OS standarizes interfaces to drivers
- Device independence & diversity
Summary of Page based DSM
- Each node contribs some mm
- Needs local caches for perf (latency)
- All nodes resposible for part of dist mem
- Home node mngs access & tracks page owner
- Explicit replication - possible for load balance, perf, reliability
Performance Considerations
DSM perf metric == access latency How do we make most from local memory? Migration • Makes sense for SRSW • Requires data movement Replication (Caching) • More general, requires consistency mgmt • Best for low latency
CPU - Device Interconnect
- Peripheral Component Interconnect (PCI)
- PCIX - PCI Extended
- PCI Express (PCIe) - Faster, low latency, more devs
- SCSI - Connects disks
- Bridge = converter from one bus to another
- Bridge/mem controller routes req to I/O = Mem map I/O
- I/O Port model
- Interupts: + immediate, - stops current exec
- Polling: + when ok for OS, - delay or CPU overhead
inode example
(Quiz on video #354 & 355)
Memory Virtualization Full
• Full virtualization
• Guest OS expects contig phys mem start @ 0
• Virtual (VM), Physical(Guest thinks phys), Machine (Actual phys mem)
Option 1:
• Guest page table: VA => PA, HypVisor: PA => MA
• Still uses TLB, MMU for PA=>MA, software VA=>PA
Option 2:
• Guest page tables VA=>PA
• Hypervisor shadow PT: VA=>MA
• Hypervisor maintains consistence
Hosted Virtualization - Type 2
- Host OS in place of hypervisor
- Host OS has device drivers / VMM module
- Guest VMs run on VMM module
- Natve apps run directly on OS
- Example: KVM - based on Linux. Guest VMs run on KVM & QEMU for hardware virt.
- Ex: Fusion, VirtualBox, VMPlayer
Virtual File System
- Abstracts location of file sys - local or remote fs
- Apps still use kernel i.e. POSIX API
- OS abstracts drivers & devices
- OS maintains persistent index of file blocks, size, permissions, etc = inode
- Dentry - directory of paths to each file in OS mem/cache. ./users/ada => /, /users, /users/ada
- Superblock abstraction - FS specific info regarding FS layout incl inode, other metadata
Demand Paging
• Virtual page memory not always in physical memory
• Phys page frame saved & restored to/from 2ndary storage
Demand Paging
• Pages swapped in/out of mem & swap partition
• Page can be ‘pinned’ - disables swapping
POSIX Shared Memory
- Not as widely supported as SysV
- Uses ‘files’ that exist in tmpfs
- Shm_open() - opens ‘file descriptor’
- mmap() / unmmap() - attach
- shm_close()
- shm_unlink()
Shared Memory Design Considerations
- One large segment
- Pool of small segments -> use queue of seg ids
- What if data > seg size? Transfer data in rounds (See Project 3!)
Device Types
- Block devices - i.e. disks - r/w blocks of data
- Character - i.e. keyboard - get/put char
- Network devices - btwn block & character - stream of data chunks
- OS standarizes by type
- OS represents devices as special device files
- UNIX like systems - /dev, tmpfs, devfs
Disk Access Optimizations
- Caching / buffering - reduce # of disk accesses
- Buffer cache in main memory
- Read/write from cache
- Periodically flush to disk (use fsync)
- Reduce head mvmt w/ I/O skeduler
- Prefetching - leverages locality i.e. read 17, also read 18, 19
- Journaling / Logging - reduce rand access. Writes updates in log with blk, offset, value. Periodically writes to proper location.
Inverted Page Tables
- Process ID and page => index to physical mem
* Supplemented with hashing page tables
Memory Virtualization Paravirtualized
- OS knows it’s in virt environ.
- No need for contig physical mem
- Reg pages with HypVisor.
- Can batch page table updates to reduce VM exists
- Other optimizations
- OHs reduced or elim on newer platforms
Hardware Virtualization
Key Virt-related hardware features - x86
• AMD Pacifica & Intel Vanderpool technology
• Close 17 bad instructions
• New protection modes - root/non-root = host/guest
• VM Control struct per V(M)CPU, ‘walked by’ hardware
•Extended Page Tables & tagged TLB w/VM ids
• Multiqueue devices & interupt routing
• Security & mgmt support
• Add’l instructions for above features
Hardware vs Software DSM
• Hardware - relies on high-end interconnect - $$$
OS thinks it has larger physical memory when memory is really on other nodes
NICs translate remote mm accesses to msgs
NICs involved in all aspects of mm mgmt
• Software
Everything done by software, OS or lingo runtime
Bare Metal Virtualizers - Hypervisors (VMM) - Type 1
- Bare metal / Hypervisor (VMM) - Privledged VM manages hardware resources & supports VMs
- Xen (Open Source or Citrix) - VMs are called domains. Priv domain, has drivers = dom 0
- ESX (VMWare) - Drivers in VMM, remote APIs
- Microsoft Hypervisor
When do task enter the ready queue?
- Creation (fork)
- Interrupt
- I/O completion
- Expiration of time slice
Page Table Entry
Flags
• Valid bit (Present bit) - indicates validity of page
• Dirty bit - set when page written to
• Accessed bit - has page been red or written to
• Protection/Permission bits - R, W, X
• U/S - User mode / Supervisor mode only
File Access Operations in Sprite
(See Sprite paper)
Virtualization requirements
- Present virtual platform interface to VMs
- Provide isolation across VMs
- Protect guest OS from apps
- Protect VMM from guest OS
/dev types
had, sda, tty, null, zero, ppp, lp, mem, console, autoconf, etcÛ
Atomic Instructions
- Hardware specific: Test & Set, Read & Increment, Compare & Swap
- Guarantees: Atomicity, mut excl, queue all concurrent instructions but one
Passthrough Model of Device Virtualization
• Approach: VM level driver configs device access permissions directly
• Has exclusive access to device, bypass HV
• Also called VMM bypass model
- Device sharing difficult
- Guest VM must have exact type of driver for device
- VM migration tricky
Why does cloud computing work?
- Law of Large Numbers - variation in resources needs per client on average over time is constant
- Economies of scale
Issues with Mutexes & Condition Vars
• Limitation of mutexes
- error prone effect correctness & ease of use
- lack of expressive power
- Require low level hardware support - atomic instructions
DSM Design - Consistency Mgmt
- DSM is similar to Shmem Multiprocessors (SMP)
- In SMP: W-I & W-U, coherence triggered on each write => OH too $$$
- Push invalidations when data written (active/eager/pessimistic)
- Pull modification info periodically (reactive/lazy/optimistic)
- When triggered depends on consistency model
Message Queue System calls
- Msgsnd - Send msg to queue
- Msgrcv - Receive msg
- Msgctl - Perform control op
- Msgget - Get msg identifier
What is consistency model?
• Agreement between mem (state) and upper software layers
• Mem guarantees to behave correctlyÛ.
Access ordering
Propagation / Visibiity of updates
What is Virtualization?
- Virtualization allows concurrent execution of multiple OSs (and their apps) on the same physical machine.
- Virtual resources - each OS thinks that it “owns” hardware resources
- VM - OS + apps + virtual resources
- Virt layer - mgmt of phys hardware (hypervisor, VM monitor) incl isolation
Cloud Computing & Animoto
- Poster child of cloud computing
- Amazon added hardware for xmas, idle rest of yr
- Opened access to resources via web based API (2006) - Birth of AWS and EC2
- Went viral - 750k new users - 50 instances -> 3400 in one week
Device Access - Programmed I/O (PIO)
- Control via command registers
- Data r/w via registers
- i.e. for 1500B w/8 bit reg or bus = 1 register command, 188 data commands
Processor Virtualization
• Guest instructions exec directly by hardware
• For non-priv ops, hardware speed = efficiency
• Priv-ops are trapped to hypervisor:
Illegal: terminate VM
legal: emulate behavior VM OS was expecting
Sprite - Analysis to Design
- Cache with write-back - write-back blocks not accessed within 30 sec
- Client wants to open file being currently written - get dirty blocks
- Open goes to server, dirs not cached
- On concurrent write -> disable caching
- Sequentiual write sharing - caching & seq semantics
OS Bypass
- Not necessary to go through kernel to get to device
- OS bypass - user accesses device directly
- Device regs/data directly accessible
- Needs user level driver
- OS retains coarse grain control
- Device needs user-level driver/library
- Demux capabiities to send results to correct process
Linux O(1)
- Constant time to select/add task, regardless of task count
- Pre-emptive, priority based
- Diff. real-time tasks vs other
- Different timeslice values for diff priorities
- Feedback - uses history to boost or lower priority
- Runqueue = 2 arrays of tasks - active & expired
- Replaced in Linux 2.6.23 by CFS scheduler
Determine CPU bound vs Memory bound
- Use heuristic based on history
- Sleep time won’t work - td not sleeping when waiting for mem, takes too much time to comp.
- Need hardware support - hardware counters - cache use, power use, etc
- Linux perf tool, oprofile used to access counters
Device Access - Direct Memory Access (DMA)
- Uses DMA controller
- CPU programs device via cmd registers and via DMA controls
- DMA controller config with in-memory addr and size of packet buffer
- i.e. for 1500B w/8 bit reg, 2 instructions but DMA step is complex
- Memory must be present in physical mem while write is happening - memory must be pinned
Run to Completion Scheduling
First-Come First Server (FCFS)
- Tasks in order of arrival | runqueue == FIFO queue
if T1 = 1s | T2 = 10s | T3 = 1s:
Throughput - 3/12s = 0.25 tasks/s
Avg job completion time (1+11+12) / 3 = 8 sec
Avg job waiting time - (0+1+11) / 3 = 4 sec
Cloud Service Models
- On premise - you manage all
- Infastructure as a service (Iaas) - Amazon EC2 - You manage OS -> Apps
- Platform as a service (PaaS) - Google App Engine - You manage app & data only
- Software as a Service (SaaS)- i.e. Gmail - you manage nothing
VFS on Disk
- VFS data structs are software entities maintained by OS FS component
- file => data blocks on disk
- inode => track files’ blocks - on disk in a block!
- Superblock has map of all blocks on device - which are inode, data, free, etc
Why has virtualization become popular only recently?
- Mainframes were not ubiquitous
- Other hardware was cheap
- Then, realized servers were underutilized
- Datacenters getting too large
- Needed more & more sys admins
- Escalating power bills
Cloud Computing Requirements
• Determine capacity on expected (peak) demand
• When demand exceeds capacity
- dropped requests => lost opportunity
• Ideal
Fine grained pricing | professionally mngd | API Acc
+ Scales elastically with demand instant & automatic
- You don’t own the resources
Overhead of address translation
- Single-level PT: 1 for PT, 1 for mem = 2 accesses
- Four level PT: 4 for PTs, 1 for mem = 5 accesses
- Page Table Cache - Trans. Lookaside Buffer (TLB)
- TLB has all flag bits
- small # of cache addr => high TLB hit rate due to temporal & spacial locality
Typical Device Access
- Process needs to access disk or NIC
- Process does system call
- OS runs in-kernel stack to form packet
- OS invokes appropriate driver
- Device driver request config
- Device performs op
- Results traverse this call chain in reverse
Cache Coherence - NCC / CC / WU / WI
- Non-cache coherent (NCC) - writing to one cache does not update other caches w/same memory addr
- Cache Coherent (CC) - Hardware updates all caches
- Write-invalidate (WI) - If one cache is changed, all others are invlidated | + lower bandwidth used
- Write-update (WU) - Hardware updates all caches w/addr | + Data avail right away
- WI/WU determined by hardware
Weak consistency
• Sync points - ops avail (r/w/sync) to upper levels
• All updates prior to a sync point will be visible
• No guar what happens in between
• Sync must be called by both writer & reader
• Variations
Single sync operation
Sep sync per subset of state (page)
Sep “entry/acquire” vs “exit/release” ops
+ Limit data movement & coherence ops
- maintain extra state for addl ops
Shortest Job First (SJF) + Preemption
- Use heuristic based on history to estimate time
* Windowed average - hold long did task run last n times?
Sync vs Async
- What happens to calling thread after I/O call?
- Depends if operation is sync or async
- Sync - process blocks and waits for I/O to complete
- Async - process continues, then checks or is notified by device or OS that oper is done (poll vs interupt)
What types of app access algos for DSM?
- Single reader / single writer (SRSW)
- Mult readers / single writers (MRSW)
- Mult readers / mult writers (MRMW)
Threads and Simultaeous Multithreading (SMT)
- Co-sked CPU-bound tds -> degrades perf 2x, mem idle
- Co-sked Mem-bound tds -> Idle CPU cycles
- Mixing CPU/Mem tds -> BINGO! Avoid contention for processor pipeline, CPU/Mem well utilized
Page Table
- Map to map pages in VM to page frames in DRAM
- Pages and page frames are same size
- Virt Addr split into Virt Page No (VPN) & Offset
- Physical Frame Num (PFN) indexs phys page
- Unused pages are reclaimed
- Each process has its own page table pointed to by CPU register (x86 it’s called CR3)