Chapter 1 and 2 HW Flashcards
The L1 cache is smaller than the L2 cache, and if there is an L3, the L2 is smaller than the L3. Give a practical and a theoretical reason why this is so.
One of the reasons that caches improve performance is temporal locality.
Assume the data of L1 cache is loaded several times from L2 cache because of temporal locality, then L2 will need to be big enough to hold L1 cache several times over.
Otherwise, the L1 cache miss produces repeated access to the same memory location in the L2 cache, and the performance gain will not be realized due to a capacity miss.
Therefore to fully exploit the temporal locality, each subsequent level should be larger than the previous.
Sketch a scenario and give some code to argue that LRU is preferable over FIFO as a replacement strategy.
Iterating through an array to produce a sum.
LRU will recognize that the sum is being accessed more frequently and retain it in the cache.
FIFO will keep replacing the cache line associated with the sum.
LRO is better at managing temporal locality while FIFO is better at exhibiting spatial locality
Mapping from memory to cache was done by using the final 16 bits of a 32-bit memory address as the cache address. Show that the problems in this example go away if the happing is done by using the first 16 bits as the cache address. Why is this not a good solution in general?
This reason is not good because of spatial locality. Considering the process of iterating through an array, each subsequent access will cause a conflict miss. If we use the least significant bits, then each subsequent access will be unique.
Show that a prime number of banks, any stride up to that number will be conflict-free.
Why do you think this solution is not adopted in actual memory architectures?
A memory bank conflict will occur when two memory addresses have the same remainder when divided by the number of memory banks.
Therefore the conflict will occur if a + ks = amodb
A is the address
S is the stride
K is the number of strided accesses
B is the number of banks
If b is a prime number, then this will only occur when either k or s is a multiple of b. Thus there will be no bank conflict for a stride that is less than or equal to b.
This is not adopted because building hardware that can perform division by prime numbers needed to convert addresses to bank selectors is very expensive.
Normally banks will be perfect power of two where modulo arithmetic can be accomplished by masking the lower order bits.
Analyze bank conflicts for recursive doubling for conflicts.
The first step will read and write data consecutively, thus no conflicts.
The second step, the memory will be accessed with a stride of 2, halving the banks being accessed, causing conflicts in the remaining half.
Assuming the number of banks is a perfect power of two, the next step will have 3 conflicts for every concurrent memory access.
Thus the number of bank conflicts overall will be on the order of n.
Analyze bank conflicts for recursive halving.
The memory is read and written in consecutive order, meaning the inner loop will be accessing independent banks, making it conflict-free.
How would you determine whether a given program kernel is bandwidth bound?
We can count the number of words of memory that the algorithm needs to access to complete its operation.
Since we know that much memory is being read from or written to, then using the memory bandwidth we can estimate how fast the memory can deliver that volume of data.
If the algorithm runs for this amount of time, then the speed is constrained by the memory bandwidth and is a memory-bound computation. Thus the performance can’t be improved unless the algorithm makes fewer memory references.
How would you determine whether a given program kernel is computer bound?
We can count the number of arithmetic operations that the algorithm must perform to complete its task.
The number of operations divided by the peak operation rate is the maximum speed of the algorithm.
If the algorithm performs at this speed, it computer bound. Thus the performance can’t be improved without using fewer operations.
Consider placing the nodes within a super step on random processors. Show that, if no two nodes wind up on the same processor, at most twice the number of communication is performed.
Every add operation takes 2 inputs and 1 output, assuming that the operator will run on the processor that owns the output memory.
In a standard partition, the operator shares the output with one of the inputs, meaning there’s exactly one edge that crosses a processor boundary for every operator at each step of the algorithm.
For random placement, every operator will have 2 edges, thus the worst case would be the number of edges can be doubled.
Consider the case of summing 8 elements with 4 processors. Show that some of the edges in the graph no longer correspond to actual communications. Now considering summing 16 elements with 4 processors. What is the number of communication edges this time?
The number of steps that will utilize communication is log2(p) steps and will be independent of the number of elements. So in all cases, the number of communicating edges will be the same, for 4 processors there will just be two steps with communications.
Consider for instance the matrix-matrix multiple C = AB, which takes 2N^3 operations where N is the matrix size. Since there are no dependencies between the operations for the elements of C, we can perform them all in parallel. If we had N^2 processors, we could assign each to a coordinate in C and have it compute C(i,j) time. Thus this parallel operation has efficiency 1, which is optimal. Show that this algorithm ignores some serious issues about memory usage:
A row of matrix A and a column of matrix B would both be read in N other processors. Thus A and B matrices will be read concurrently N times.
If on the other hand, a copy of the rows of A and the columns of B were broadcast to the appropriate processors, then every parallel task will need to store O(N) memory. Since there will be O(N^2) processes, the overall memory requirements will be O(N^3) which is higher than the amount of memory needed to store the input or output matrices which can be stored in O(N^2) space.
Give the number of wires and the number of switching elements that are needed in both cases to connect n processors and memories. What is the time that a data packet needs to go from memory to processor, expressed in the unit time that it takes to traverse a unit length of wire and the time traverse a switching element? (Crossbar Switch)
For the crossbar switch, there are n^2 switches and each with wires attached in each of the two dimensions.
Thus there are 2n^2 wires in a crossbar switch.
The time in terms of write delay for a crossbar switch will be n and switch delays will be 1 as only a single switch will need to be traversed.
Give the number of wires and the number of switching elements that are needed in both cases to connect n processors and memories. What is the time that a data packet needs to go from memory to processor, expressed in the unit time that it takes to traverse a unit length of wire and the time traverse a switching element? (Butterfly Switch)
For the butterfly exchange network there a log2(P) levels of switches with each level having p/2 switches so that the overall number of switches in the network will be plog2(p) / 2 since there are 2 wires attached to each switch the total of plog2(P) wires in the network.
The wire delay and the switch delay will be proportional to log2(P) for the butterfly exchange network.