Final Exam Flashcards
Pthreads Sync for IPC
• Init'ed w/Pthread_process_shared: - pthread_mutexattr_t - pthread_condattr_t • Sync info must be avail to all processes - create mutex in shared memory region (See video #274 for code sample)
Copy (Messages) vs Shared Memory
Copy (Messages) CPU cycles to copy data to/from port • Smaller messages Map (Shmem) CPU cycles to map mem to addr space CPU to copy data to channel Set up once, use many x || lg block 1 x-> good payoff
Freeing Physical Memory
• When should pages be swapped?
Memory usage above threshold (high watermark)
CPU usage below threshold (low watermark)
• Which pages should be swapped?
History based prediction - Least Recently Used (LRU)
Pages without dirty bit set
Avoid non-swappable pages
Catagorize pages into diff types
Second chance variation of LRU - make 2 passes
before deciding which page to swap
Memory Allocation
Memory Allocator
• Determines VA->PA mapping, checks validity/perm
Kernel Level Allocators
• Allocate pages for kernel state, processes states
• Keep track of free memory
User-level allocators
• Dynamic process state (heap) using malloc/free
• Once malloc’d, kernel not involved. Used by dlmalloc, jemalloc, hoard, tcmalloc
Homogeneous Architectures
• Each node can do any processing step
+ Keeps front end simple - doesn’t mean each node has all data but can get to all data
- How to benefit from caching?
• Scale by adding more nodes
Priority Scheduling
- Tasks have different priority levels
- Run highest priority task next
- Can preempt lower priority tasks
Copy-on-Write
If new process is created, why copy all the memory?
• Instead, map new VA to original page
• Write protect original page
• If only read, this will save memory & time
• On write, page fault and copy/update page tables
• Only copy pages on write => Copy-on-Write
Split Device Driver Model
• Approach: device access ctrl split btwn
front end driver in guest VM (device API)
back-end driver in service VM (or host)
• Modified guest drivers
• Limited to paravirtualized guests
+ Eliminates emul OHs
+ Allow better mgmt of shared devices
Internet Service
• Any service avail through web interface (mail, weather, banking, etc) 3 components: • Presentation - static content • Biz logic - dynamic content • Database tier - data store
Problems with Trap & Emulate
• x86 pre 2005:
4 rings, no root/non-root modes yet
HV = ring 0, guest OS = ring 1
• 17 instructions do not trap, fail silently
• i.e. interupts, POPF/PUSHF
• HV doesn’t know there was an error, assumes successful
Strict consistency
Strict consistency
• Updates are visible everywhere immediately
• In practice, even on SMP there are no guarantees
• Even harder on DSM
• Doesn’t really exist
More Sync Constructs (All need hardware support)
- Serializers - Good w/priorities, hides need for signalling, condition vars
- Path expressions - uses reg exp to define access
- Barriers - Reverse semaphore
- Rendevous Points - Waits for mult threads to need point
- Optimistic Wait Free Sync (RCU) - Bet on no conflict
What are the overheads associated with scheduling? Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long timeslices)?
- Longer timeslice keeps CPU util & throughput high, benefits CPU-bound tasks (fewer cxt switches)
- Short timeslice issues I/O ops sooner, better CPU and device utilization, and better user-perceived performance.
Device Drivers
- Driver needed for each type of hardware
- Resp for device access, mgmt, control
- Provided by device manufactuerer
- OS standarizes interfaces to drivers
- Device independence & diversity
Summary of Page based DSM
- Each node contribs some mm
- Needs local caches for perf (latency)
- All nodes resposible for part of dist mem
- Home node mngs access & tracks page owner
- Explicit replication - possible for load balance, perf, reliability
Performance Considerations
DSM perf metric == access latency How do we make most from local memory? Migration • Makes sense for SRSW • Requires data movement Replication (Caching) • More general, requires consistency mgmt • Best for low latency
CPU - Device Interconnect
- Peripheral Component Interconnect (PCI)
- PCIX - PCI Extended
- PCI Express (PCIe) - Faster, low latency, more devs
- SCSI - Connects disks
- Bridge = converter from one bus to another
- Bridge/mem controller routes req to I/O = Mem map I/O
- I/O Port model
- Interupts: + immediate, - stops current exec
- Polling: + when ok for OS, - delay or CPU overhead
inode example
(Quiz on video #354 & 355)
Memory Virtualization Full
• Full virtualization
• Guest OS expects contig phys mem start @ 0
• Virtual (VM), Physical(Guest thinks phys), Machine (Actual phys mem)
Option 1:
• Guest page table: VA => PA, HypVisor: PA => MA
• Still uses TLB, MMU for PA=>MA, software VA=>PA
Option 2:
• Guest page tables VA=>PA
• Hypervisor shadow PT: VA=>MA
• Hypervisor maintains consistence
Hosted Virtualization - Type 2
- Host OS in place of hypervisor
- Host OS has device drivers / VMM module
- Guest VMs run on VMM module
- Natve apps run directly on OS
- Example: KVM - based on Linux. Guest VMs run on KVM & QEMU for hardware virt.
- Ex: Fusion, VirtualBox, VMPlayer
Virtual File System
- Abstracts location of file sys - local or remote fs
- Apps still use kernel i.e. POSIX API
- OS abstracts drivers & devices
- OS maintains persistent index of file blocks, size, permissions, etc = inode
- Dentry - directory of paths to each file in OS mem/cache. ./users/ada => /, /users, /users/ada
- Superblock abstraction - FS specific info regarding FS layout incl inode, other metadata
Demand Paging
• Virtual page memory not always in physical memory
• Phys page frame saved & restored to/from 2ndary storage
Demand Paging
• Pages swapped in/out of mem & swap partition
• Page can be ‘pinned’ - disables swapping
POSIX Shared Memory
- Not as widely supported as SysV
- Uses ‘files’ that exist in tmpfs
- Shm_open() - opens ‘file descriptor’
- mmap() / unmmap() - attach
- shm_close()
- shm_unlink()
Shared Memory Design Considerations
- One large segment
- Pool of small segments -> use queue of seg ids
- What if data > seg size? Transfer data in rounds (See Project 3!)
Device Types
- Block devices - i.e. disks - r/w blocks of data
- Character - i.e. keyboard - get/put char
- Network devices - btwn block & character - stream of data chunks
- OS standarizes by type
- OS represents devices as special device files
- UNIX like systems - /dev, tmpfs, devfs
Disk Access Optimizations
- Caching / buffering - reduce # of disk accesses
- Buffer cache in main memory
- Read/write from cache
- Periodically flush to disk (use fsync)
- Reduce head mvmt w/ I/O skeduler
- Prefetching - leverages locality i.e. read 17, also read 18, 19
- Journaling / Logging - reduce rand access. Writes updates in log with blk, offset, value. Periodically writes to proper location.
Inverted Page Tables
- Process ID and page => index to physical mem
* Supplemented with hashing page tables
Memory Virtualization Paravirtualized
- OS knows it’s in virt environ.
- No need for contig physical mem
- Reg pages with HypVisor.
- Can batch page table updates to reduce VM exists
- Other optimizations
- OHs reduced or elim on newer platforms
Hardware Virtualization
Key Virt-related hardware features - x86
• AMD Pacifica & Intel Vanderpool technology
• Close 17 bad instructions
• New protection modes - root/non-root = host/guest
• VM Control struct per V(M)CPU, ‘walked by’ hardware
•Extended Page Tables & tagged TLB w/VM ids
• Multiqueue devices & interupt routing
• Security & mgmt support
• Add’l instructions for above features
Hardware vs Software DSM
• Hardware - relies on high-end interconnect - $$$
OS thinks it has larger physical memory when memory is really on other nodes
NICs translate remote mm accesses to msgs
NICs involved in all aspects of mm mgmt
• Software
Everything done by software, OS or lingo runtime
Bare Metal Virtualizers - Hypervisors (VMM) - Type 1
- Bare metal / Hypervisor (VMM) - Privledged VM manages hardware resources & supports VMs
- Xen (Open Source or Citrix) - VMs are called domains. Priv domain, has drivers = dom 0
- ESX (VMWare) - Drivers in VMM, remote APIs
- Microsoft Hypervisor
When do task enter the ready queue?
- Creation (fork)
- Interrupt
- I/O completion
- Expiration of time slice
Page Table Entry
Flags
• Valid bit (Present bit) - indicates validity of page
• Dirty bit - set when page written to
• Accessed bit - has page been red or written to
• Protection/Permission bits - R, W, X
• U/S - User mode / Supervisor mode only
File Access Operations in Sprite
(See Sprite paper)
Virtualization requirements
- Present virtual platform interface to VMs
- Provide isolation across VMs
- Protect guest OS from apps
- Protect VMM from guest OS
/dev types
had, sda, tty, null, zero, ppp, lp, mem, console, autoconf, etcÛ
Atomic Instructions
- Hardware specific: Test & Set, Read & Increment, Compare & Swap
- Guarantees: Atomicity, mut excl, queue all concurrent instructions but one
Passthrough Model of Device Virtualization
• Approach: VM level driver configs device access permissions directly
• Has exclusive access to device, bypass HV
• Also called VMM bypass model
- Device sharing difficult
- Guest VM must have exact type of driver for device
- VM migration tricky
Why does cloud computing work?
- Law of Large Numbers - variation in resources needs per client on average over time is constant
- Economies of scale
Issues with Mutexes & Condition Vars
• Limitation of mutexes
- error prone effect correctness & ease of use
- lack of expressive power
- Require low level hardware support - atomic instructions
DSM Design - Consistency Mgmt
- DSM is similar to Shmem Multiprocessors (SMP)
- In SMP: W-I & W-U, coherence triggered on each write => OH too $$$
- Push invalidations when data written (active/eager/pessimistic)
- Pull modification info periodically (reactive/lazy/optimistic)
- When triggered depends on consistency model
Message Queue System calls
- Msgsnd - Send msg to queue
- Msgrcv - Receive msg
- Msgctl - Perform control op
- Msgget - Get msg identifier
What is consistency model?
• Agreement between mem (state) and upper software layers
• Mem guarantees to behave correctlyÛ.
Access ordering
Propagation / Visibiity of updates
What is Virtualization?
- Virtualization allows concurrent execution of multiple OSs (and their apps) on the same physical machine.
- Virtual resources - each OS thinks that it “owns” hardware resources
- VM - OS + apps + virtual resources
- Virt layer - mgmt of phys hardware (hypervisor, VM monitor) incl isolation
Cloud Computing & Animoto
- Poster child of cloud computing
- Amazon added hardware for xmas, idle rest of yr
- Opened access to resources via web based API (2006) - Birth of AWS and EC2
- Went viral - 750k new users - 50 instances -> 3400 in one week
Device Access - Programmed I/O (PIO)
- Control via command registers
- Data r/w via registers
- i.e. for 1500B w/8 bit reg or bus = 1 register command, 188 data commands
Processor Virtualization
• Guest instructions exec directly by hardware
• For non-priv ops, hardware speed = efficiency
• Priv-ops are trapped to hypervisor:
Illegal: terminate VM
legal: emulate behavior VM OS was expecting
Sprite - Analysis to Design
- Cache with write-back - write-back blocks not accessed within 30 sec
- Client wants to open file being currently written - get dirty blocks
- Open goes to server, dirs not cached
- On concurrent write -> disable caching
- Sequentiual write sharing - caching & seq semantics
OS Bypass
- Not necessary to go through kernel to get to device
- OS bypass - user accesses device directly
- Device regs/data directly accessible
- Needs user level driver
- OS retains coarse grain control
- Device needs user-level driver/library
- Demux capabiities to send results to correct process
Linux O(1)
- Constant time to select/add task, regardless of task count
- Pre-emptive, priority based
- Diff. real-time tasks vs other
- Different timeslice values for diff priorities
- Feedback - uses history to boost or lower priority
- Runqueue = 2 arrays of tasks - active & expired
- Replaced in Linux 2.6.23 by CFS scheduler
Determine CPU bound vs Memory bound
- Use heuristic based on history
- Sleep time won’t work - td not sleeping when waiting for mem, takes too much time to comp.
- Need hardware support - hardware counters - cache use, power use, etc
- Linux perf tool, oprofile used to access counters
Device Access - Direct Memory Access (DMA)
- Uses DMA controller
- CPU programs device via cmd registers and via DMA controls
- DMA controller config with in-memory addr and size of packet buffer
- i.e. for 1500B w/8 bit reg, 2 instructions but DMA step is complex
- Memory must be present in physical mem while write is happening - memory must be pinned
Run to Completion Scheduling
First-Come First Server (FCFS)
- Tasks in order of arrival | runqueue == FIFO queue
if T1 = 1s | T2 = 10s | T3 = 1s:
Throughput - 3/12s = 0.25 tasks/s
Avg job completion time (1+11+12) / 3 = 8 sec
Avg job waiting time - (0+1+11) / 3 = 4 sec
Cloud Service Models
- On premise - you manage all
- Infastructure as a service (Iaas) - Amazon EC2 - You manage OS -> Apps
- Platform as a service (PaaS) - Google App Engine - You manage app & data only
- Software as a Service (SaaS)- i.e. Gmail - you manage nothing
VFS on Disk
- VFS data structs are software entities maintained by OS FS component
- file => data blocks on disk
- inode => track files’ blocks - on disk in a block!
- Superblock has map of all blocks on device - which are inode, data, free, etc
Why has virtualization become popular only recently?
- Mainframes were not ubiquitous
- Other hardware was cheap
- Then, realized servers were underutilized
- Datacenters getting too large
- Needed more & more sys admins
- Escalating power bills
Cloud Computing Requirements
• Determine capacity on expected (peak) demand
• When demand exceeds capacity
- dropped requests => lost opportunity
• Ideal
Fine grained pricing | professionally mngd | API Acc
+ Scales elastically with demand instant & automatic
- You don’t own the resources
Overhead of address translation
- Single-level PT: 1 for PT, 1 for mem = 2 accesses
- Four level PT: 4 for PTs, 1 for mem = 5 accesses
- Page Table Cache - Trans. Lookaside Buffer (TLB)
- TLB has all flag bits
- small # of cache addr => high TLB hit rate due to temporal & spacial locality
Typical Device Access
- Process needs to access disk or NIC
- Process does system call
- OS runs in-kernel stack to form packet
- OS invokes appropriate driver
- Device driver request config
- Device performs op
- Results traverse this call chain in reverse
Cache Coherence - NCC / CC / WU / WI
- Non-cache coherent (NCC) - writing to one cache does not update other caches w/same memory addr
- Cache Coherent (CC) - Hardware updates all caches
- Write-invalidate (WI) - If one cache is changed, all others are invlidated | + lower bandwidth used
- Write-update (WU) - Hardware updates all caches w/addr | + Data avail right away
- WI/WU determined by hardware
Weak consistency
• Sync points - ops avail (r/w/sync) to upper levels
• All updates prior to a sync point will be visible
• No guar what happens in between
• Sync must be called by both writer & reader
• Variations
Single sync operation
Sep sync per subset of state (page)
Sep “entry/acquire” vs “exit/release” ops
+ Limit data movement & coherence ops
- maintain extra state for addl ops
Shortest Job First (SJF) + Preemption
- Use heuristic based on history to estimate time
* Windowed average - hold long did task run last n times?
Sync vs Async
- What happens to calling thread after I/O call?
- Depends if operation is sync or async
- Sync - process blocks and waits for I/O to complete
- Async - process continues, then checks or is notified by device or OS that oper is done (poll vs interupt)
What types of app access algos for DSM?
- Single reader / single writer (SRSW)
- Mult readers / single writers (MRSW)
- Mult readers / mult writers (MRMW)
Threads and Simultaeous Multithreading (SMT)
- Co-sked CPU-bound tds -> degrades perf 2x, mem idle
- Co-sked Mem-bound tds -> Idle CPU cycles
- Mixing CPU/Mem tds -> BINGO! Avoid contention for processor pipeline, CPU/Mem well utilized
Page Table
- Map to map pages in VM to page frames in DRAM
- Pages and page frames are same size
- Virt Addr split into Virt Page No (VPN) & Offset
- Physical Frame Num (PFN) indexs phys page
- Unused pages are reclaimed
- Each process has its own page table pointed to by CPU register (x86 it’s called CR3)
Heterogeneous Architectures
• Functionally heterogeneous
Diff nodes, diff tasks/requests
Data doesn’t have to uniformly accessible everywhere
+ Benefit of locality & caching
- Front end more complex
- More complex mgmt
• Scale by knowing demand and adding nodes accordingly
Semaphores
- Common sync construct
- Acts like a traffic light - stop or go
- Similar to mutex but more general
- Have an init value - positive integer = max count
- On Try (wait): If non-0, decrement & proceed
- On Exit (post): Increment
- Designed by Dijkstra Dutch - Proberen(P) - Wait, Verhogen (V) - Post
Stateless File Server
• Stateless - keeps no state
- Can’t support caching & consistentcy mgmt
- Every request self contained -> more bits xfered
+ No resoources used on server side (CPU/mm)
+ On failure, simply restart
IPC Command Line Tools
• ipcs - List all IPC facilities
-m display shmem info only
• ipcrm - Delete IPC facility
- [shmid] delete specific shmid
Statefull File Server
• Stateful - keeps client state - needed for practical model to track what is cached
+ Supports locking, caching, inc ops
- On failure, needs checkpointing & recovery mech
- Overheads to maintain state & consistency - this depends on caching mech & consistency protocol
Shortest Job First (SJF) Scheduling
Schedule tasks in order of exec time
runqueue == ordered queue or tree
if T1 = 1s | T3 = 1s | T2 = 10s:
Throughput - 3/12s = 0.25 tasks/s
Avg job completion time (1+2+12) / 3 = 4.7 sec
Avg job waiting time - (0+1+2) / 3 = 1 sec
Why do we care about virtualization
NAME?
Page Fault
If MMU detects physical memory access cannot be performed, it causes a Page Fault.
• Pushes error code on kernel stack (x86 - reg CR2)
• Traps into kernel
• Page fault handler in OS determines action - get page from disk, permission violation, etc
Binary Translation
- Main idea - rewrite VM binary to never issue 17 instructions that don’t trap
- Mendal Rosenblum, founder of VMWare
- Goal is to run unmodified Oss = full virtualization
- Approach: dynamic binary translation
- Inspect code blocks for 17 bad instructions
- If found, translate to alt sequence - adds OH
Multi-Level Page Tables
• Flat page tables no longer used
• Now hierarchical, multi-level structures
• Outer page table & inner page table
• More page levels can be added
Important w/64-bit arch. - memory is more sparse
+ more granular, - more memory access time
Page Size
• Determined by offset size
• 10-bit = 2^10 = 1024 = 1k page size
• 12-bit = 2^12 = 4096 = 4k page size
• Linux default = 4k, can be 2MB (large), 1 GB (huge)
• Solaris/SPARC - 8kB, 4 MB, 2 GB
+ Lg’r page size => small’r pg table => more TLB hits
- internal fragmentation => wasted memory
Priority Inversion
Higher priority thread may need mutex held by lower priority thread. Lower thread will complete first before giving up the lock, therefore boosting its priority temporarily.
Linux Completely Fair Scheduler (CFS)
- Better fairness & interactive task performance
- Runqueue == Red-Black Tree (self balancing)
- Ordered by virtual runtime (time spent on cpu)
- Pick leftmost node (node w/ least time on cpu)
- Periodically adjust virtual runtime
- rate faster for low priority / slower for high priority
- Select task: O(1) | Add task: O(log N)
Defining Virtualization
• A VM is an efficient, isolated, duplicate of real machine
• Supported by VMM:
1. Provide environ essentially same as orig machine (Fidelity)
2. Programs show minor decrease in speed
3. VMM is in complete control of system resources
• Goals: fidelity, performance, saftey, isolation
Hardware Protection Levels
- on x86, 4 protection rings - Highest priv - Ring 0 (OS level), Lowest - Ring 3 (Application level)
- Can be used to put hypervisor in Ring 0 as root, OS in Ring 1 / apps stay in Ring 3 - non-root
- Guest which try illegal ops cause Vmexit, when handled by hypervisor, VMEntry back
Direct control of device - ioctl()
Example - determines size of a blockdevice
fd = open(argvv[1], O_RDONLY);
ioctl(fd, BLKGETSIZE, &numblocks);
close(fd);
Cloud Deployment Models
- Public - 3rd party clients/tenants
- Private - leverage tech internally
- Hybrid - Public/Priavte - failover, dealing with spikes, redundency, testing
- Community - Public cloud used by certain type of users
Virtualization Advances
- 1st vector - CPU modifications = VTX,VTX2, VTX3
- 2nd vector - Chipset Focus = VTD
- 3rd vector - IO Device Focus - DMA/Multiqueue = IOV, VMDq
System V Shared Memory
- Segments not necessarily contiguous
- Shmem is system-wide - limits on # and size
- Uses unique key created by ftok
- Steps: Create, attach, detach, destroy (or reboot)
Examples of Big Data Stacks
- Hadoop
- Berkeley Data Analytics Stack (BDAS)
- HPCC Stack (Nexus-Lexus)
Spinlock ‘Delay’ Alternatives
• Delay after lock is released
+ Contention improved | + Latency - ok
- Delay - much worse
• Can also delay in each loop. Works on NCC arch but increases delay even more
• Fixed delay - i.e. CPU ID
• Dynamic Delay (backoff based) - rand delay in range based on perceived contention in system
• Percevied = cnt of failed test_n_sets
Spinlock Performance Comparisons
(from Anderson paper)
• Queue has huge overhead with many procs - worst, best < 7
• Spin-on-read is worse < ~5, best > ~7
• Static delay, dynamic delay, static ref - simiular performance -
• Delays - static better than dynamic
• Delay after every mem ref is better than delay after lock is freed
NO SINGLE GOOD ANSWER!
Queueing Lock (Anderson)
• Uses array of flags with n elements (n = # of cpus)
• Two pointers - current lock holder & next queue
- Needs atomic operation
- O(N) size
- Latency - more costly r&inc | + Delay | + Contention
(See videos #315 & #316)
Device Virtualizations
- For CPUs and mem, less diversity => standarization of interface (ISA level)
- For devices, more diversity, lack of standards
- 3 key models to virtialize devices: Passthrough Model, HypVisor Direct Model, Split Device Driver Model
Implementing DSM
- DSM must intercept every access to DSM state
- Need to decide if local or remote
- Need to trigger coherence msgs
- OHs should be avoided for local, non-sh state (pgs)
- Dynamically ‘engage’ and ‘disengage’ DSM as need
- Use MMU support - traps if invalid/not allowed
- Trap will pass to DSM layer to send msg
- Cached content write will trap and pass to DSM for coherence ops
- Other MMU info useful (dirty, etc)
Where can files be stored in DFS with single server and many clients?
- Client memory
- Client storage device (HD/ SSD)
- Buffer cache in server memory - The usefulness of this depends on client loads, request interleaving, etc)
Buddy Allocator
• On memory request, subdivides memory into 2^x chunks and finds the smallest 2^x chunk that can satisfy request.
• On free, check buddy to see if you can aggregate into a larger chunk. Aggregation works well/is fast.
(Video #250)
Sprite DFS Access Patterns
• 33% of accesses are writes, 75% of files open < 0.5 sec, 90% of files open < 10 sec, 20-30% new data erased within 30 sec, 50% within 5 min, file share rare
=> caching ok but Write-Thru not sufficient, OH too high w/session-sem
• Write-back on close not necessary
• Must support concurrent access but not optimize
Peer Distributed Apps
- Big data apps, web search, content sharing, DSM
- State is distrib & stored in all nodes
- Each node owns part of the state, provides service
- All nodes are peers, but may be a mgr node(s)
Monitors
• Higher level sync construct -MESA, Java
• Specifies shared resource
• Entry procedure
• Possible condition vars
• On entry, locks & checks are done
• On exit, unlock, check, signal
Also, a programming style using threads & concurrency
DFS Models
- Client/server
- File server distributed on multiple machines - mirrored (has limit) or partitioned (scales!)
- Files store on / served from all machines - Blurs lines between clients & servers
File vs Directory Services
- Knowing access patterns
- i.e. sharing freq, write freq, import of consistent view
- Two types of files - directories & reg files
- Could use session-sem for files, UNIX for dirs
- Or less freq write-back for files than dirs
Shared Memory IPC
• Physical memory mapped into VA space of multiple processes
• So different VAs map to same PA
• VAs are not the same
• Phys mem doesn’t need to be contiguous
+ Sys calls only for set-up, data copy reduced
- Explicit sync, comm protocol, shmem mgmt
Scheduling on multi-CPU systems
- Cache affinity is important - keep task on same CPU as possible
- Hierarchical sked arch. = Load balancer divides work bx CPU, scheduler for each CPU
- Multiple memory nodes - closer = faster
- NUMA - Non Uniform Mem Access platform
- QPI - Quick Path Interconnect
What is the name of the data structure that the scheduler uses as its ready queue?
- Runqueue - tightly coupled to scheduling algo
* OS can use multiple runqueues with different timeslices. See Multi-Level Feedback Queue (MLFQ)
Replication vs Partitioning
• Replication - Each machine holds all files
+ can load balance => more avail & fault tolerant
- writes become more complex - can write to all or write one and propagate to others. Replicas need algo to resolve collisions -ie voting
• Partitioning - Each machine has subset of files
+ Better availability, scalability, single file writes are simpler
- In failure, some data can be lost. Harder to load balance => hot spots
• Combine techniques for best results
Virtualization Technologies
- Virtual Box is virtualization
* Java VM & Virtual Gameboy - not
DSM Architecture
• Page based DSM arch distrib nodes each contrib local mm pool of pages from all nodes each pg has id, frame # • if MRMW need local caches for performance home (mgr) node drives coherence ops all nodes responsible for part of dist mem mgmt • Home node - keeps state: page accessed, mods, caching enable, locks • Current owner (may not be home node) • Exact replicas - for load bal, perf, reliability
POSIX Semaphore API
sem_t sem; sem_init(sem_t •sem, int pshared, int count); sem_wait(sem_t •sem); sem_post(sem_t •sem); Equiv to mutex: sem_init(sem, 0, 1);
Indexing DSM - Where is particular page?
DSM metadata - each page has: • address = node ID + page frame # • node ID = Home node Global Map (replicated on all nodes) • object (page) id => manager node id • manager map avail on each node Metadata for local pages • Per page metadata is distrib across mgrs
Casual consistency
- Guar they will detect causal relations between updates
* Concurrent writes - no guarantees
Basic I/O device features
• Have control registers to allow CPU/device comm
Command regs - controls device
Data regs - control data xfer
Status regs - info about device status
• Microcontroller - device ‘cpu’ | Device memory
• Other - i.e. A/D converters, etc
System V Shared Memory API
- Shmget(shmid, size, flags) - create or open using key generated by ftok(pathname, prog_id) IPC_CREATE used to create
- Shmat(shmid, addr, flags) - addr can be NULL, cast addr to type needed
- Shmdt(shmid) - detach
- Shmctl(shmid, cmd, buf) - destroy w/IPC_RMID
Distributed Shared Memory
- Each node owns state
- Provides services - mem r/w from any node, consistency protocols
- Permits scaling beyond limits of a single machine
- More shared mem at lower cost
- Slower overall mem access
- Commodity interconnect tech suport RDMA that provide low latency access to remote memory
How many datacenters in the world in 2015? How many sq feet?
2011: 510k datacenters - 285.5M sq feet
2015: Est 1 - 1.5B datacenters!
File Sharing Semantics in DFS
- Single node (machine) environment - If process rites file, it is avail to other process immediately
- Multi client - one client may change server but other clients not be updated => must use more relaxed semantics
- UNIX semantics - every write avail immediately
- Session semantics - write-back on close(), update on open(). Not great for concurrent sharing or when files stay open for a long time.
- Lease on cached data (NOT a locking mech) => puts bounds on inconsistency
- Augment with flush() / sync() API
- Can use immutable files - can’t delete
- Transactional guar - all changes atomic
Network File System (NFS)
- From Sun
- System calls Virtual File System (VFS)
- Divides local / network requests
- Client talks to server via RPC which talks to server VFS as if reaauest came from server itself
- Pass back if file exists (not ‘stale’)
Requirements of Cloud
- Fungible resources
- Elastic, dynamic resource allocation
- Scale: Mgmt at scale, scablable resource alloc
- Dealing with failures
- Multi-tenancy - performance & isolation
- Security
DSM - Sharing Granularity
X Cache line granularity? OH too high for DSM X Variable granularity - too fine Õ Page granularity (OS level) Õ Object granularity (language runtime) • Beware of false sharing!
Spinlocks
• Basic sync construct • Similar to mutex: - mutual exclusion - lock/unlock (free) • Differ from mutex: - When lock busy, execution continues, not blk'ed - Burns CPU cycles check for avail • NEED HARDWARE SUPPORTED ATOMIC INSTRUCTS
What is a common tasks that is implemented in software in hybrid (hard+software) DSM implementations?
• Prefetch pages
Addr translation & triggering invalidations done in hardware!
Test and Test & Set Spinlock
while ((lock == busy) OR (test_and_set(lock) == busy))
• In this case, lock == busy checks cache, so no contention, 2nd arg not executed
• Also called ‘Spin on Read’ or ‘Spin on Cache’
+ Latency & Delay - ok
- Contention - better than t&s, but CC WI - BAD, CC WU - OK, NCC - No diff
• Complexity with WU - O(N) | WI - O(N^2)
Sprite Distributed File System
- Research DFS
- Paper: Caching in the Sprite File Net. File System
- Used trace data on file use/access patterns
Test & Set
- Atomically returns (test) orig val and sets new value (busy)
- First thread: test_and_set(lock) = 0 => free
- Next tds: test_and_set(lock) = 1 => busy. Sets lock to 1 which is ok since it’s already 1
Page Table Size
32-bit architecture
• Page Table Entry (PTE) - 4 bytes PFN + flags
• VPN = 2^32 / Page Size
• Page size typically 4kb
• (2^32 / 2 ^ 12) • 4 bytes = 4MB per process!
• For 64-bit w/8 byte PTE = 32PB per process!!
• This assumes entry for each VP, needed or not
Cloud Enabling Technologies
- Virtualization
- Resource provisioning (scheduling) - Yarn, Mesos
- Big Data processing (hadoop, Mapreduce, Spark)
- Storage: DFS, NoSQL, distib in-mem cache
- Software-defined resource slices
- Monitoring - real time processing (Flume, CloudWatch, Log Insight)
Remote File System - Comprimise
• Allow clients to store some of file locally (blocks)
+ lower latency for cache access
+ Server load reduced => more scalable
• Force clients to interact with server (frequently)
+ Server has insight into client activity
+ Server can control access
- Server more complex, req diff file shring semantics
Hypervisor Direct Model of Device Virtualization
• Approach: VMM intercepts all device calls • Emulate device operation: translate to generic I/O op translate VMM resident I/O stack invoke VMM resident I/O driver \+ VM decoupled from phys device \+ Sharing, migration simpler - Adds latency to device access, complexities in HV drivers
Sequential consistency
- Similar to SMP with threads using no locks
- Memory updates from diff processors may be arbitraily interleaved
- All processes will see exact same interleaving
- Ops from same process will appear in order they were issued, not interleaved
Pseudo Devices
- Accept and discard all output /dev/null
* Produce var len of pseudo rand nums - /dev/random
Cloud as Big Data Engine
- In 2011, 1.8 zettabytes of data created
- All in US tweet 4320x/day
- 200 billion feature films - takes 47 million years to watch
- Data storage layer | Data processing layer | Caching layer | Language front ends | Analytics libs | Continuously streaming data
Shared Memory & Sync
- Since memory can be concurrently accessed, must be sync’ed
- Can use pthreads mutexs & cond vars
- OS supported IPC sync functions
- Must coordinate concur access and signal when data is avail
Round Robin Scheduling
- Pick up first task from queue (like FCFS)
- If task is block, it yields to next and goes to back of the line.
- Can be used with priorities using preemption
- RR w/interleaving == timeslicing - each thread gets a set amount of time on CPU - Comparable metrics to SJF
CPU Bound vs I/O Bound Timeslicing
CPU Bound
• Use longer TS => Better throughput/avg. comp time, worse wait time
• Limits cxt swtch overhead
I/O Bound
• Use smaller TS => CPU & device util higher, better perceived performance
Cache Coherence and Atomics
• Atomics always issued to memory controller
+ No race conditions will occur
- Atomics will be slower due to bus/IC contention
- Generates coherence traffic to update caches - changes or not - also makes op slower
Paravirtualization
• Goal - give up on unmodified guests
• Modify guest so it knows it’s running in virt environ
• Makes explicit HV calls = hypercalls
• Hypercalls like system calls:
Package context info, spec. hypercall, trap to VM
• i.e. XenSource - Open Source HV, bought by Citrix
Segmentation
- Segments have aribitrary granularity
- Segments correspond to different memory parts e.g. code, heap, stack, data, etc
- Addr = Segment descriptor + offset
- Segment size = seg base + limit regs = linear addr
- Typically used with paging unit to get phys addr
Hyperthreading
- Multiple hardware supported execution contexts
- Multiple sets of registers for separate threads
- Also called hardware/chip/simultaneous multithreading
- Still only one CPU
- Up to 8 sets of registers
- Can be enabled/disabled at startup
- Can hide memory access latency
Timeslice Scheduling
Maximum amount of uninteruptted time given to a task. Also called time quantum. Shares the CPU
+ Short tasks finish sooner => More responsive
+ lengthy I/O ops initiated sooner
- Overheads - interupt, schedule, cxt switch. Keep timeslice > overhead
NFS Versions
- Since 80’s
- NFSv3 - stateless, NFSv4 - Stateful
- Caching is session based
- Also supports periodic updates - default: 3 sec files, 30 sec directories
- Locking - lease-based mech. Limited time for lock.
- NFSv4 has r/w lock - ‘share reservation’
Formula for CPU utilization
• [cpu_running_time / (cpu_running_time + context_switching_overheads)] • 100
• The cpu_run_time and cxt_swtch_ovrhds should be calc’ed over a consistent, recurring interval
Helpful Steps to Solve:
• Determine a consistent, recurring interval
• In the interval, each task should be given an opportunity to run
• During that interval, how much time is spent computing? This is the cpu_running_time
• During that interval, how much time is spent cxt switch? This is the cxt_switch_overheads
• Calculate!
Cycles Per Instruction (CPI) counter (Federova experiment)
- Memory bound has high CPI
- CPU bound has CPI of approx 1
- In Federova experiment, mixed types worked best
- In real world, most apps use same amont of CPI, so this doesn’t work
- Ergo, sked’s should be aware of resource contention as well as load balancing.
- Subsequent work shows Last Level Cache (LLC) is a hardware counter that works well
Spinlock Performance Metrics
- Reduce latency - Time to get free lock - ideally immed.
- Reduce wait time (delay) - Time to stop spin & get lock - ideally immed.
- Reduce contention - bus/IC traffic - ideally zero
- First two metrics conflict with 3rd
IPC Overview
Processes share memory - Data in shared memory Processes exchange messages - Message passing via sockets Requires sync - Mutexes, waiting
Test & Set Spinlock
(See code in video #309)
+ Latency is minimal - just atomic
+ Delay minimal - spinning on atomic
- Contention - all spinning processors go to mem on each spin
Cloud computing vision
- John McCarthy predicated in 1961 that computing would be a utility like electricity or phone
- Computing == fungible utility
- Limitations - API lock in, hardware dependencies, latency, privacy, security
Cloud Computing Overview
- Shared resources - infastructure & software/services
- APIs for access & config - web based, libs, CLI
- Billing/accounting - spot, reservation, entire mkt
- Typically discreet quant: tiny, S, M, L, XL
- Managed by cloud provider
- OS Open Stack, VMWare stack
IPC Message Passing
• OS creates & maintains a channel - buffer, queue
• OS provides interface to process - i.e. a port
• Processes send/write msgs to a port
• Processes rcv/read msgs from a port
• Kernel required to establish & perform
- overheads
+ simplicity - kernel does mgmt/sync
Remote File Service - extremes
• Extreme 1 - Upload/download - gets entire file, operates on it, uploads it i.e. ftp, git
+ local r/w at client
- entire file up/down even for small access
- no server control
• Extreme 2 - True Remote File Access - every access done on remote file
+ file accessed centralized, easy consistency
- slow, high network costs, limited scalability
Hardware support for memory
- MMU translates virtual address -> physical addr & reports faults
- Registers hold pointers to page table, base & limit size, no. of segments
- MMU Cache - Translation Lookaside Buffer (TLB)
IPC Message Types
- Pipe - connects processes w/byte stream
- Msg queues - carry msgs among processes, incl priorities, sked. APIs SysV & POSIX
- Sockets - Uses ports, kernel level socket buffer, kernel processing - tcp/ip etc
Multi-Level Feedback Queue (MLFQ)
- OS can use multiple runqueues with different timeslices.
- New slices go from shortest to longest timeslice as more time needed as well as vice-versa
- Fernando Corbato won Turing Award for MLFQ
Caching State in DFS
- Client can locally maintain portion of state
- Client can perform ops on cache state (open/r/w/cl)
- Required coherence mechanisms - WU / WI get triggered on write to file, on demand, periodically, or on open
- Client or server driven
Block device stack
- File = logical storage unit mapped to phys storage
- Kernel file system has info how to find/access file
- OS specifies interface - can replace FS
- Linux uses POSIX API
- Generic block layer - standard for all block devices
- Device driver uses protocol specific APIs
inodes with indirect pointers
• Solves limit on file size with inodes
• inode has metadata, blk pointers as before (1k blks)
• But also now has ‘indirect pointers’ to a block(s) of pointers - 256k per entry
• Double indirect pointer - block of block of pointers - 64MB
• Triple indirect pointer
+ Allows small inode, access large files
- Slower, more disk accesses
Virtual vs Physical Memory
- Virtual address space is much larger than physical
- OS must allocate physical mem - swaps memory to/from disk.
- VM has pages, Physical has page frames of same size
- OS must arbitrate memory use - translate/verify virtual addr -> phys. Addr. Uses page tables or segments
ext2: Second Extended Filesystem v2
• In Linux, now ext4
• For each block group => # inodes, # disk blocks, start of free blocks
• Group descrip - bitmaps, # free nodes, # dirs
• Bitmaps - tracks free blocks and inodes
• inodes - # from 1 to max, 1 per file 128 bytes
• Data Blocks - file data
(remember, directories are just single files that have their own dentry!)
Reader / Writer Locks
• Read can have shared access, Write needs exclusive
• Specify type of access for lock
In Linux, rw_lock_t m;
read_lock(m); // critical sec; read_unlock(m);
write_lock(m); // critical sec; write_unlock(m);
Difference in diff OS’s: read_unlock effect, upgrade/downgrade priorities, interaction with sked policy
Checkpointing
Failure & recovery management technique
• Periodically save process state
• When failure, restart from c.p. => faster
• Simple approach - pause & copy
• Better - Only copy pages that have been modified
except to rebuild, mult. diffs needed
• Other uses - debugging, migration
Slab Allocator
- Builds custom object caches of common sizes
- Caches point to phys contiguous memory
- Slabs are contiguous memory pages
- Avoids interal fragmentation
What file sharing semantics are supported by NFS and it’s cache consistency mechs?
- Not Unix
- Maybe a little session based dep on config
- Maybe a little periodic based dep on config
- Not immutable
So really none of the above!
Distrub File Systems
- Accessed by higher level well definied interface - access via VFS
- Focus on consistent state
- Mixed distribution models - replicated or partitioned, peer-like system
- Can be as simple as local disk and remote disk
Shared Memory Multiprocessors
- Bus-based - only one request in flight at a time
- Interconnect (I/C) based - multiple requests can be in flight at once
- Also called Symmetric multiprocessors (SMPs)
- Caches used to hide mem latency - more in shmem due to contention
- No-write - CPU writes directly to mem, cache invalid
- Write-thru - both cache and mem written to
- Write-back - write to cache, mem updated later
Cloud Failure Probability
Cloud has N = 10 CPUs. Prob of failure 0.03
• Prob of failure 1- (0.97^10) = 26%
n = 100: 1-(0.97^100) = 95%
Metrics for Scheduling Algos
Throughput
Avg job completion time
Avg job waiting time
CPU Utilization
IPC
OS-supported mechanisms for interaction amoung processes (coord & communication)
• Message passing - Sockets, pipes, message queues
• Memory-based IPC = Shmem, mem mapped files
• Higher level semantics - RPC, Files
• Sync primatives - Semaphores, mutexes
How does scheduling work? What are the basic steps and datastructures involved in scheduling a thread on the CPU?
• A CPU scheduler decides how & when tasks (thread or process) access a shared CPU.
• A dataqueue is struct. used
• Steps:
Context switch to selected task
Enter user mode
Set PC to next instruction in scheduled task
Other IPC Sync
• Message queues
Example: P1 sends ‘ready’, P2 rcvs & sends ‘OK’
• Semaphores
- OS supported comm construct
- Semaphores are binary - either 0 or 1
- Can be used like a mutex
inodes
• index of all disk blocks correspoinding to file
• Uniquely numbered starting at 1
• Has list of all blocks in a file
• Other metadata - permissions, status, etc
+ Easy to perform seq or rand access to file
- Limit on file size
OS Scheduler - Task Assignment (Processes or Threads)
• Assign tasks immediately Simple Scheduling (FCFS) • Assign simple tasks first Maximize throughput (SJF) • Assign complex tasks first max. util of CPU, devices, memory
Internet Services Architecture
• For scale, multi-process, multi-node -> scale-out architecture
1. Boss-worker: front-end distribs reuests to nodes
2. All equal: all nodes can handle all steps (func homogeneous)
3. Specialized nodes: nodes only handle certain steps
(functionaly heterogeneous)