Wk 9 : SIMD Flashcards
Why do I care about SIMD?
It’s becoming the new standard. There’s only so much you can do to maximize the ability to do things superscalar and we’ve pretty much hit that limit in terms of speed.
What’s the biggest challenge to overcome that arises with SIMD?
Lots and lots of data streams to memory
Where’s SIMD best?
Dot products, matrix multiply, dealing with arrays, especially big arrays.
How does Amdahl pitch-in when it comes to SIMD?
It can only see improvements based on the fraction of the program that I improve. Not all parts of and not every program allow for speed up via SIMD
How much can an SIMD register hold? How many of them are there?
4 sgl precision floats (32 bit, 4 byte each) or 2 double precision floats (64 bit, 8 byte each)….there are 16 SIMD registers
differentiate names of SIMD assy instructions
addss (scalar sgl prec)
addps (packed sgl prec)
What are the four options available to take advantage of SIMD?
Write directly in assembly
Use C libraries created for it
Use C intrinsics
Compiler vectorization options
Why is memory alignment so important in SIMD? On what value should it be aligned? Plus the assembly code for moving aligned variables is so much faster
Want to get all of the data in one read from memory and store in one write. If we start adding cycles to load/store, we lose the gains we were looking for by doing everything in parallel
Align on 128 bits (16 byte)
What should memory bus size be to make thi SIMD worth our while? Why?
128 bits. If takes 4 cycles to get from memory that doesn’t help us.
What does using intrinsics buy us?
Power of assembly like commands without getting bogged into details of individual registers, etc.
How are MSB / LSB represented?
Backwards
LSB -> MSB (left part affected via scalar)
What are the 3 main issues of SIMD to footstomp?
- Must use memory alignment
- Must explicitly say when load / store
- Must handle overhead of shuffles.
Why is explicitly specifying loads/stores so important with SIMD?
If my code isn’t efficient with regard to memory accesses, I could be using SIMD and end up doing more loads and stores than necessary and negating the benefits of SIMD in the first place
How did fast math make a difference on the homework?
y = y + a[i]
with fast math, it indexes by 4. Without fast math, it still uses 1. This is because compiler is taking more liberty because its told its allowed to via fast math and its not as worried about aliasing issues.