Multicore Processors (2) Flashcards by Tabby Black

What is memory consistency?

What ordering do we see between reads and writes from another core that uses the same shared memory

How well did you know this?

Not at all

Perfectly

Look at consistency example

How well did you know this?

Not at all

Perfectly

Look at reordering memory ops example

How well did you know this?

Not at all

Perfectly

Why might loads and stores to different addresses get reordered?

To allow higher performance from core

How well did you know this?

Not at all

Perfectly

What can reordering memory ops lead to?

Non-intuitive situations

How well did you know this?

Not at all

Perfectly

What does coherence ensure?

That cached data written by a core is seen by others

How well did you know this?

Not at all

Perfectly

What does the memory consistency model determine?

The order that operations from one core can be seen by others

How well did you know this?

Not at all

Perfectly

What is relaxed consistency? When do we want to use this?

Some loads and stores can bypass each other
Relative ordering between operations to the same address is maintained
Want relaxed consistency processors

How well did you know this?

Not at all

Perfectly

What type of consistency do we want for shared data?

Sequential (strong) consistency
All reads and writes by a single processor are seen in the order they occur

How well did you know this?

Not at all

Perfectly

Describe a memory barrier

Guarantee ordering of memory operations within a core - used to force sequential consistency
All prior memory operations complete before the barrier finishes execution ie. store after a barrier can’t overtake a load before it

How well did you know this?

Not at all

Perfectly

Look at example of memory barrier

How well did you know this?

Not at all

Perfectly

What are atomic operations?

Uninterruptible sequences of operations that appear to all occur as one
Used to create software synchronisation primitives eg. locks, thread barriers etc.

How well did you know this?

Not at all

Perfectly

Describe Read-Modify-Write (RMW)

Most basic class of atomic operation
Provide ability to read a memory location and simultaneously write a new value back

How well did you know this?

Not at all

Perfectly

Give 2 examples of RMW

Atomic exchange
Fetch and add

How well did you know this?

Not at all

Perfectly

Draw a diagram of atomic exchange (between X in cache and R1 in core)

How well did you know this?

Not at all

Perfectly

What is atomic exchange? Why is it difficult in RISC machines?

Study These Flashcards

Read and write in one interruptible instruction
But would require a complex instruction and in RISC we prefer simple load or store instructions

What can we use instead of atomic exchange?

Study These Flashcards

Load reserved / store conditional - instead of one instruction, provide two halves

Describe load reserved / store conditional

Study These Flashcards

Instructions are linked together: store only succeeds if X hasn’t changed, any write to X causes the store to fail
Source register contains 1 on success, 0 on failure

Write the RISC instruction for atomic exchange between X in cache and R1 in core

Study These Flashcards

atomic_exch X, r1

Write the RISC instructions for load reserved / store conditional between X in cache and R1 in core

Study These Flashcards

load_reserved r0, X
store_conditional r1, X

Write out the instruction sequence for atomic exchange

Study These Flashcards

xchg: mov r3, a0
lr r4, 0(r1)
sc r3, 0(r1)
beqz r3, xchg (fail and branch back to top of loop)
mov a0, r4

Write out the instruction sequence for fetch-and-add

Study These Flashcards

fadd: lr r4, 0(r1)
add r4, r4, 1
sc r4, 0(r1)
beqz r4, fadd

Write out the instruction sequence for a more optimised spin lock

Study These Flashcards

lock: lr r4, 0(r1)
bneq r4, lock
mov r3, #1
sc r3, 0(r1)
beqz r3, lock

Write out the instruction sequence for a simple spin lock

Study These Flashcards

lock: mov r3, #1 (take lock)
lr r4, 0(r1)
sc r3, 0(r1)
beqz r3, lock (check if store conditional was successful)
bneq r4, lock (if lock in r4 contains 1 staret again because lock has been taken)

Give the advantage and disadvantage of the code for a more optimised spin lock

+ More efficient because checking if lock has been taken before doing atomic exchange, prevents unnecessary writes - Instructions between lr and sc so increased likelihood of conditional failing

Look over naive spin lock example

What is the disadvantage of the naive spin lock?

Causes lots of bus traffic - lots of cache coherence work due to attempted writes

Look over spin lock with local caching example

Why might we add memory barriers to spin locks?

So that nothing can be reordered against the lock ie. otherwise locked values could leak into memory before all cores realised they are locked

Write out the instruction sequence for locking a spin lock with barrier

lock: lr r4, 0(r1) bneq r4, lock mov r3, #1 sc r3, 0(r1) beqz r3, lock membar (so no backwards propagation of values that have been changed)

Write out the instruction sequence for unlocking a spin lock with barrier

unlock: membar (make sure all operations protected by the lock are finished before we unlock it) st zero, 0(r1)

In which 2 ways can a load reserved be implemented?

1. Could bring data into the cache in S state Now store conditional must check state hasn't changed and issue BusRdX to allow modification of data 2. Bring data in M state on the load reserved Now store conditional just needs to check that state hasn't changed, but causes contention if 2 or more cores wanting to lock

Look at original example with locks

Why do we need to consider TLB coherence?

Updates (eg. from OS) to page tables can result in stale TLB entries

What are the 2 approaches to TLB coherence?

1. TLB shoot-downs: OS flushes TLB entries on every core Relies on inter-processor interrupts to trigger updates on every core 2. Hardware keeps TLB coherent with PTEs

What is a disadvantage of TLB shoot-downs?

Expensive - every core gets involved and causes lots of PT lookups as TLB repopulates

What is a disadvantage of using hardware for TLB coherence?

Adds complexity to hardware

Multicore Processors (2) Flashcards

(37 cards)