Multicore Flashcards
Diagram:
Single-Core CPU Chip

Overview of
Multicore Architectures
- Replicate multiple processor cores on a single die
- The cores fit into a single processor socket
- Also called Chip Multi-Processor ( CMP )
- Cores run in parallel
- Within each core, threads are time-sliced (just like on a uniprocessor)
- OS percieves each core as a separate processor
- Scheduler maps threads/processes to different cores
Multicore:
Interaction with the
Operating System
- OS perceives each core as a separate processor
- OS Scheduler maps threads/processes to different cores
- OS is likely multi-threaded itself, scheduling it’s own use of the cores
- Most major OSs support multi-cores today:
- Windows, Linux, Mac OS X, …
Motivation for using
Multicore Processors
- It is difficult to make single-core clock frequencies even higher
- Deeply pipelined circuits:
- heat problems
- Interconnect delays dominate
- difficult design and verification
- large design teams necessary
- Many new applications are multithreaded
- General trend in computer architecture
- Shift towards more parallelism
Instruction Level
Parallelism
- Parallelism at the machine-instruction level
- The processor can
- re-order instructions,
- pipeline instructions
- split instructions into microinstructions
- do aggressive branch prediction
- etc
- Instruction-Level parallelism enabled rapid increases in processor speeds over the last 15 years
Instruction Level
Improvements
- Architectural improvements have become small and incremental:
- Additional circuitry contributes little to application performance
- More likely additional interconnect delays will slow processor’s cycle time, reducing performance for all applications
Thread-Level Parallelism (TLP)
- Parallelism on a more coarse scale
- Server can serve each client in a separate thread
- A computer game can do AI, graphics and physics on three separate threads
- Single-Core superscalar processor cannot fully exploit TLP
- Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP
Multiprocessors:
Definition
Any computer with several processors
Multiprocessors:
Types
Single Instruction Multiple Data (SIMD)
- ex: Modern Graphics Cards
Multiple Instructions, Multiple Data (MIMD)
Multiprocessors:
Memory Types
Shared Memory
In this model, there is one(large) common shared memory for all processors
Distributed Memory
In this model, each processor has its own(small) local memory.
It’s content is not replicated anywhere else.
Processors have some other communication mechanism.
What is a “Multi-Core” Processor?
- A special kind of multiprocessor
- All processors are on the same chip
- Multicore processors are MIMD:
- Different cores execute different threads( Multiple Instructions)
- operating in different parts of memory (Multiple Data)
- Multi-core is a shared memory multiprocessor:
- All cores share the same memory
Types of Applications
that benefit from
Multi-Core Architecture
- Database Servers
- Web Servers
- Compilers
- Multimedia applications
- Scientific applications, CAD/CAM
- In general, applications with Thread-Level parallelism
Simultaneous Multithreading (SMT)
- A technique complementary to multi-core
- Addresses the problem of the processor pipeline getting stalled
- Permits multiple independent threads to execute simultaneously on the same core
- Weaving together multiple threads on the same core
- Without SMT, only a single thread can run at any given time
- Cannot simultaneously use the same functional unit
Processor Pipeline Stall:
Two Causes
- Waiting for the result of a long floating point or integer operation
- Waiting for data to arrive from memory
- Other execution units wait unused if no SMT
Why SMT is not a “true” Parallel Processor
- Enables better threading (e.g. up to 30%)
- OS and applications perceive each simultaneous thread as a separate “virtual processor”
- The chip has only a single copy of each resource
- Compare to multicore:
- Each core has its own copy of resources
Combining
Multi-Core and
SMT
- Cores can be SMT-enabled (or not)
- Number of SMT threads:
- 2, 4, or something 8 simultaneous threads
- Intel calls them “hyperthreads”
Different Combinations:
- Single-Core, non-SMT (standard uniprocessor)
- Single-Core, SMT
- Multi-Core, non-SMT
- Multi-Core, SMT
Comparison:
Multi-Core vs SMT
Multicore:
- Several cores, each is smaller and not as powerful
- Easier to desgin and manufacture
- Great with thread-level parallelism
SMT:
- Can have one large and fast superscaler core
- Great performance on a single thread
- Mostly still only exploits instruction-level parallelism
Memory Hierarchy:
- SMT
- Multi-Core Chips
Simultaneous Multithreading Only:
All caches are shared
Multicore Chips:
- L1 caches are private
- L2 caches private in some architectures, shared in others
*Memory is always shared
What are “Fish” Machines?
- Dual-core Intel Xeon processors
- Each core is hyper-threaded
- Private L1 caches
- Shared L2 caches
Advantages of
Private Caches
- Closer to core, so faster access
- Reduces contention
Advantages of
Shared Caches
- Threads on different cores can share the same cache data
- More cache space available if a single (or a few) high-performance thread runs on the system
Cache Coherence Problem
Since multicore has private caches,
how to keep data consistent across caches?
- Each core should perceive memory as a shared, monolithic array
- One core copies something into its cache, makes changes, and writes back to memory
- But a second core reads the stale copy before core 1 writes back into memory
- This is a general problem with multiprocessors, not just multicore
- There are many solution algorithms and coherence protocols designed to deal with this
Cache Coherence:
Simple Solution
Invalidation-based protocol
with snooping
Alternatively: Update protocol
Cache Coherence:
What is “snooping”?
All cores continuously “snoop”, or monitor,
the bus connecting the cores
Cache Coherence:
Invalidation Protocol
Basic Idea
If a core writes to a data item,
all other copies of this data item in other caches
become invalidated.
This is accomplished by sending an invalidation request on the bus.
Cache Coherence:
Update Protocol
Upon changing a data item,
a core broadcasts the updated value on the bus.
*Alternative to the Invalidate Protocol.
Cache Coherence:
Invalidation Protocol
vs
Update Protocol
- When performing multiple writes to the same location:
- Invalidation:
- only used on the first write
- Update:
- must broadcast each write, including the new variable value
- Invalidation:
- Invalidation protocol generally performs better:
- generates less bus traffic
- typically requires less logic
Cache Coherence;
Advanced Invalidation Protocols
- More sophisticated protocols use extra cache state bits
- State Bits:
- M - Modified
- E - Exclusive
- S - Shared
- I - Invalid
- Protocols can be MSI, or MESI
- Note: Memory used as semaphores has special requirements
Programming for
Multi-Core
- Programmers have a choice of using
- multiple threads, or
- multiple processes
- Spread the workload across multiple cores
- Write Parallel algorithms
- OS will map threads/processes to cores
- Thread safety is very important
Programming for Multicore:
Thread Safety:
Things to keep in mind
Thread Safety is VERY IMPORTANT
- Pre-emptive Context Switching:
- Context switch can happen AT ANY TIME
- Dealing with true concurrency,
- not just uniprocessor time-slicing
- Concurrency bugs are exposed much faster when dealing with multi-core
Multicore Programming:
Assigning Threads
to the Cores
- Each thread/process has an Affinity Mask
- The affinity mask specifies which cores the thread is allowed to run on
- Different threads can have different masks
- Affinities are inherited across fork()
Affinity Masks
Overview
- Affinity Masks are bit vectors that specify which cores a thread can run on
- Without SMT:
- One bit for each core, 1 if allowed, 0 if not
- Without SMT:
- When Multicore and SMT are combined:
- Affinity Mask stores separate bits for each Simultaneous Thread:
- 2 bits for each core
- Affinity Mask stores separate bits for each Simultaneous Thread:
- By default, an affinity mask is all 1s, allowing a thread to run on any core
Affinity Masks:
- Default
- Assignment
- By default, an affinity mask is all 1s:
- All threads can run on all processors/cores
- Then, the OS Scheduler decides which threads run on which cores
- OS Scheduler detects skewed workloads,
- migrates threads to less busy processors
- Programmers can also set their own affinities
- These are called Hard Affinities
Context Switching:
Cost
Context Switching is Costly
- Need to restart the execution pipeline
- Cached data is invalidated
- OS Scheduler tries to avoid migration as much as possible
- Tends to keep a thread in the same core
- This is called Soft Affinity
What is
Soft Affinity
The tendency of the OS Scheduler to keep a thread in the same core.
What are
Hard Affinities
Affinities that are explicitly defined by programmers.
Rule of Thumb:
Use the default scheduler unless there is a good reason not to.
When to set your own Affinities
- Two (or more) threads share data-structures in memory:
- map to the same core so they can share a cache
- Real-Time threads:
- Example:
- A thread running a robot controller
- Must not be context switched, or else robot can become unstable
- Dedicate an entire core just to this thread
- Example: