Single Processor Computing Flashcards
Von Neumann Architecture
A stream of instructions executed in an order
Control Flow
A prescribed sequence of instructions
Stored Program
Instructions and program data stored in the same memory
Fetch, Execute, Store
Load next instruction onto processor,
Operation is executed
Writes content back to memory
Direct To Memory Architecture
Allows data to be directly sent to and from memory to processors.
Work stays in the memory
Out-of-order instruction handling
Instructions can be processed in a different order.
Why do modern processors use out-of-order instruction handling?
Increased performance with pipelining.
Pipelining
Allowing the CPU to work on multiple smaller instructions at the same time.
Execution time for pipelined instructions?
t(n) = (n + n[1/2])x
x = basic operation time
n = Linear operation time
n[1/2] = Sublinear operation time.
Instruction level parallelism (ILP)
Finding independent instructions and running them in parallel.
Speculative Execution
Assumes the test will turn out true or consistent.
Prefetching
Data can be speculatively requested before any instruction needing it is actually encountered.
Branch Prediction
Guessing whether a conditional instruction will evaluate to true and executing accordingly.
von Neumann bottleneck
Transferring data from memory takes more time than actually processing the data.
How does the memory hierarchy address the von Neumann bottleneck?
By reducing the number of times the CPU has to wait for data to be transferred
Latency
The delay between the processor issuing a request from a memory item and the item arriving.
Bandwidth
The rate at which data arrives at its destination after the initial latency is overcome.
What role do registers play in the memory hierarchy?
Registers are the fastest and allow the CPU to access data instantly without going to slower caches or memory.
What role do cache systems play in the memory hierarchy?
Cache stores frequently accessed data that the CPU might need without having the CPU go to the main memory.
Cache Tags
Info keeping track of the location and status of data in the cache.
What are the 3 types of cache misses in single-processor systems?
Compulsory: Caused when data is first accessed and is not present in the cache.
Capacity: Caused by data having been overwritten because the cache can not contain all your problem data.
Conflict: Caused by one data item being mapped to the same cache location as another.
What additional type of cache miss occurs in multi-core systems?
Coherence: multiple processor cores have copies of the same data in their cache, but those copies become inconsistent due to changes.
What two properties of typical program memory access patterns are exploited by cache systems?
Temporal Locality: When a memory location is accessed, it is more likely to be accessed again.
Spatial locality: When a memory location is accessed, it is likely that nearby memory locations will also be accessed.
LRU Replacement Policy
Least Recently Used
FIFO Replacement Policy
First In First Out
Cache Line
The smallest unit of data on a cache.
What property of program memory accesses do cache lines exploit?
Spatial locality.
What effect does stride have on cache line efficiency?
If the cache line is large, then having a large stride means we’ll only access a small part of it.
If the cache line is very small, then striding wouldn’t necessarily cause a problem.
Cache Mapping
Deciding how data is stored on the cache.
Direct Mapped Cache
Each memory block is mapped to only one specific cache location.
Fully associative cache
Each memory block can be mapped to any cache location.
K-way set associative cache
The cache is divided into a number of sets, each containing k cache lines.
Cache Memory
Small, fast memory used to store frequently accessed data
Dynamic Memory
Data that is currently being executed.
Stall
Delay in the execution of instructions caused by a dependency between instructions.
Cache Miss
When the data is not available in the cache and has to be fetch from main memory.
Prefetch data stream
When the processor tries to predict what data will be accessed and fetch that data to the cache.
What is a hardware prefetch and how is it different from a prefetch intrinsic?
Intrinsic = Controlled by software
Hardware = Controlled by hardware
What is Little’s Law?
What does it tell us about the effect of latency on computing systems?
Concurrency = Bandwidth X Latency.
Increasing latency will increase the build-up of requests and reduce performance.
Why are memory banks used in memory systems?
To increase memory bandwidth.
If multiple requests target the same bank (bank conflict), the requests must be handled in serial, which reduces memory bandwidth.
What is a bank conflict?
When two consecutive operations access the same memory bank at the same time.
What is the TLB?
What function does it have in the memory hierarchy?
Translation Look-aside Buffer: a cache of frequently used page table entries, providing fast address translation of pages.
If a program needs a memory location, the processor first checks the TLB, else it looks up the page in the main memory.
What property would a program have that would cause performance degradation due to TLB misses?
Poor spatial locality
How big is a typical TLB?
Between 64 and 512 entries.
What are cache-aware and cache-oblivious algorithms?
Aware: Designed to take advantage of specific cache sizes.
Oblivious: Designed to perform well on a wide range of sizes without requiring any knowledge of cache parameters.
What is loop tiling and what is it used for?
Breaking up loops into smaller, nested loops to increase performance
What is the motivation for multi-core architectures?
Separate cores can work on unrelated tasks, or introduce data parallelism.
Define Core, Socket, and Node
Core: General Processing Unit of Distributed memory
Socket: Connects a CPU to the motherboard
Node: Contains multiple sockets accessing the same shared memory
Cache Coherence
Ensuring all cached data are an accurate copy of the main memory.
False Sharing
When multiple threads access different variables located in the same cache line.
NUMA
Non-uniform memory access:
For a processor running on some core, the memory attached to its socket is faster to access than the memory attached to another socket.
First touch phenomenon
Dynamically allocated memory is not allocated until it’s first written to.
What is arithmetic intensity and how do we compute it?
Number of operations per memory access.
f(n) / n where f(n) is the number of operations it takes, and n is the number of data items that an algorithm operates on.
What is the roofline model?
What defines the roofline?
Visually expresses the arithmetic intensity and performance limits.
Performance limit defines the roofline.
How can we use the roofline model to optimize code performance?
To see if the code is limited by hardware or arithmetic throughput.