Papers for Scheduling, Memory Management, IPC, and Sync Flashcards
What is the premise of Fedorova’s paper on scheduling for chip multi processors?
That scheduling can have a high impact on CPU performance and can sometimes even counteract the benefits of smart chips
Where does Fedorova consider the bottleneck to be for CPU performance?
contention for processor pipeline because that is resource that single vs multi-threaded processors are competing for
What types of threads should be scheduled on the same CPU according to Fedorova?
Threads that won’t compete for the same resources as each other.
What is Instruction delay latency?
delay between when an instruction starts and when it is completed (i.e. how long is the thread blocked by an instruction). Can be used to approximate processor pipeline performance
How do we approximate the instruction delay latency?
Use the mean cycles-per-instruction (CPI), which is already supported by the hardware. High CPI == higher resource usage, low CPI == lower resource usage because that means that thread is blocked a lot
What is the outcome of Fedorova paper?
That we should schedule threads with high and low CPIs together
What is the test_and_set implementation?
while(test_and_set(lock) = BUSY) {}. (+) good latency and low delay, but (-) lots of contention
What is the test_and_test_and_set implementation?
While((lock == busy) or (test_and_set(lock) == busy));
This is better because we spin on a cache lookup instead of a main memory lookup == better contention
What is the spin lock delay implementation?
An improvement on test_and_test_and_set. Set a delay in the while loop. This helps decrease contention, but causes more delay. Can have static or dynamic delay
What is a queuing lock?
waiting threads are placed in a queue. Requires more memory and has lower latency, but low delay and has best contention
What are the considerations for spin lock performance according to Anderson?
- Want to minimize the communication from spinning processors on the memory buses
- Want to minimize delay between releasing a lock and a new thread acquiring it
- Want to minimize the time it takes to acquire a lock when it is free (don’t want lots of checking to accomplish 1 and 2)
What is quiescence?
time it takes for all of the waiting processors to each try their test_and_set when a lock is released
What is the main problem with test_and_set according to Anderson?
- Lock is released -> invalidates caches
- Each processor tries a test_and_set at the same time–> creates a queue of test_and_set operations that are waiting to happen
- Each test_and_set invalidates every cache
- Processors cannot unqueue their test_and_set operation
- This is worse when the critical section is short
What are the hardware improvements that Anderson mentions for test_and_set?
- Provide parallel access to memory locations -> reduces test_and_set overhead
- Only invalidate cache if test_and_set changes the value