Multicore Flashcards by Emma Frost

Diagram:

Single-Core CPU Chip

How well did you know this?

Not at all

Perfectly

Overview of

Multicore Architectures

Replicate multiple processor cores on a single die
The cores fit into a single processor socket
Also called Chip Multi-Processor ( CMP )
Cores run in parallel
Within each core, threads are time-sliced (just like on a uniprocessor)
OS percieves each core as a separate processor
Scheduler maps threads/processes to different cores

How well did you know this?

Not at all

Perfectly

Multicore:

Interaction with the

Operating System

OS perceives each core as a separate processor
OS Scheduler maps threads/processes to different cores
OS is likely multi-threaded itself, scheduling it’s own use of the cores
Most major OSs support multi-cores today:
- Windows, Linux, Mac OS X, …

How well did you know this?

Not at all

Perfectly

Motivation for using

Multicore Processors

It is difficult to make single-core clock frequencies even higher
Deeply pipelined circuits:
- heat problems
- Interconnect delays dominate
- difficult design and verification
- large design teams necessary
Many new applications are multithreaded
General trend in computer architecture
- Shift towards more parallelism

How well did you know this?

Not at all

Perfectly

Instruction Level

Parallelism

Parallelism at the machine-instruction level
The processor can
- re-order instructions,
- pipeline instructions
- split instructions into microinstructions
- do aggressive branch prediction
- etc
Instruction-Level parallelism enabled rapid increases in processor speeds over the last 15 years

How well did you know this?

Not at all

Perfectly

Instruction Level

Improvements

Architectural improvements have become small and incremental:
- Additional circuitry contributes little to application performance
More likely additional interconnect delays will slow processor’s cycle time, reducing performance for all applications

How well did you know this?

Not at all

Perfectly

Thread-Level Parallelism (TLP)

Parallelism on a more coarse scale
Server can serve each client in a separate thread
A computer game can do AI, graphics and physics on three separate threads
Single-Core superscalar processor cannot fully exploit TLP
Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP

How well did you know this?

Not at all

Perfectly

Multiprocessors:

Definition

Any computer with several processors

How well did you know this?

Not at all

Perfectly

Multiprocessors:

Types

Single Instruction Multiple Data (SIMD)

ex: Modern Graphics Cards

Multiple Instructions, Multiple Data (MIMD)

How well did you know this?

Not at all

Perfectly

Multiprocessors:

Memory Types

Shared Memory

In this model, there is one(large) common shared memory for all processors

Distributed Memory

In this model, each processor has its own(small) local memory.

It’s content is not replicated anywhere else.

Processors have some other communication mechanism.

How well did you know this?

Not at all

Perfectly

What is a “Multi-Core” Processor?

A special kind of multiprocessor
- All processors are on the same chip
Multicore processors are MIMD:
- Different cores execute different threads( Multiple Instructions)
- operating in different parts of memory (Multiple Data)
Multi-core is a shared memory multiprocessor:
- All cores share the same memory

How well did you know this?

Not at all

Perfectly

Types of Applications

that benefit from

Multi-Core Architecture

Database Servers
Web Servers
Compilers
Multimedia applications
Scientific applications, CAD/CAM
In general, applications with Thread-Level parallelism

How well did you know this?

Not at all

Perfectly

Simultaneous Multithreading (SMT)

A technique complementary to multi-core
Addresses the problem of the processor pipeline getting stalled
Permits multiple independent threads to execute simultaneously on the same core
Weaving together multiple threads on the same core
Without SMT, only a single thread can run at any given time
Cannot simultaneously use the same functional unit

How well did you know this?

Not at all

Perfectly

Processor Pipeline Stall:

Two Causes

Waiting for the result of a long floating point or integer operation
Waiting for data to arrive from memory
- Other execution units wait unused if no SMT

How well did you know this?

Not at all

Perfectly

Why SMT is not a “true” Parallel Processor

Enables better threading (e.g. up to 30%)
OS and applications perceive each simultaneous thread as a separate “virtual processor”
The chip has only a single copy of each resource
Compare to multicore:
- Each core has its own copy of resources

How well did you know this?

Not at all

Perfectly

Combining

Multi-Core and

SMT

Study These Flashcards

Cores can be SMT-enabled (or not)
Number of SMT threads:
- 2, 4, or something 8 simultaneous threads
Intel calls them “hyperthreads”

Different Combinations:

Single-Core, non-SMT (standard uniprocessor)
Single-Core, SMT
Multi-Core, non-SMT
Multi-Core, SMT

Comparison:

Multi-Core vs SMT

Study These Flashcards

Multicore:

Several cores, each is smaller and not as powerful
Easier to desgin and manufacture
Great with thread-level parallelism

SMT:

Can have one large and fast superscaler core
Great performance on a single thread
Mostly still only exploits instruction-level parallelism

Memory Hierarchy:

SMT
Multi-Core Chips

Study These Flashcards

Simultaneous Multithreading Only:

All caches are shared

Multicore Chips:

L1 caches are private
L2 caches private in some architectures, shared in others

*Memory is always shared

What are “Fish” Machines?

Study These Flashcards

Dual-core Intel Xeon processors
Each core is hyper-threaded
Private L1 caches
Shared L2 caches

Advantages of

Private Caches

Study These Flashcards

Closer to core, so faster access
Reduces contention

Advantages of

Shared Caches

Study These Flashcards

Threads on different cores can share the same cache data
More cache space available if a single (or a few) high-performance thread runs on the system

Cache Coherence Problem

Study These Flashcards

Since multicore has private caches,

how to keep data consistent across caches?

Each core should perceive memory as a shared, monolithic array
One core copies something into its cache, makes changes, and writes back to memory
But a second core reads the stale copy before core 1 writes back into memory
This is a general problem with multiprocessors, not just multicore
There are many solution algorithms and coherence protocols designed to deal with this

Cache Coherence:

Simple Solution

Study These Flashcards

Invalidation-based protocol

with snooping

Alternatively: Update protocol

Cache Coherence:

What is “snooping”?

Study These Flashcards

All cores continuously “snoop”, or monitor,

the bus connecting the cores

Cache Coherence: Invalidation Protocol Basic Idea

If a core writes to a data item, all _other copies_ of this data item in _other caches_ become _invalidated_. This is accomplished by sending an **_invalidation request_** on the bus.

Cache Coherence: Update Protocol

Upon changing a data item, a core broadcasts the **_updated value_** on the bus. \*Alternative to the Invalidate Protocol.

Cache Coherence: Invalidation Protocol vs Update Protocol

* When performing multiple writes to the same location: * Invalidation: * only used on the first write * Update: * must broadcast each write, including the new variable value * Invalidation protocol generally performs better: * generates less bus traffic * typically requires less logic

Cache Coherence; Advanced Invalidation Protocols

* More sophisticated protocols use extra **_cache state bits_** * State Bits: * M - Modified * E - Exclusive * S - Shared * I - Invalid * Protocols can be MSI, or MESI * Note: Memory used as semaphores has special requirements

Programming for Multi-Core

* Programmers have a choice of using * multiple threads, or * multiple processes * Spread the workload across multiple cores * Write Parallel algorithms * OS will map threads/processes to cores * Thread safety is very important

Programming for Multicore: Thread Safety: Things to keep in mind

Thread Safety is VERY IMPORTANT * Pre-emptive Context Switching: * Context switch can happen **_AT ANY TIME_** * Dealing with _true concurrency_, * not just uniprocessor time-slicing * _Concurrency bugs_ are exposed _much faster_ when dealing with multi-core

Multicore Programming: Assigning Threads to the Cores

* Each thread/process has an **Affinity Mask** * The affinity mask specifies _which cores_ the thread is allowed to run on * Different threads can have different masks * Affinities are inherited across fork()

Affinity Masks Overview

* Affinity Masks are bit vectors that specify which cores a thread can run on * Without SMT: * One bit for each core, 1 if allowed, 0 if not * When Multicore and SMT are combined: * Affinity Mask stores separate bits for each Simultaneous Thread: * 2 bits for each core * By default, an affinity mask is all 1s, allowing a thread to run on any core

Affinity Masks: - Default - Assignment

* By default, an affinity mask is all 1s: * All threads can run on all processors/cores * Then, the _OS Scheduler_ decides _which threads_ run on _which cores_ * OS Scheduler detects skewed workloads, * migrates threads to less busy processors * Programmers can also set their own affinities * These are called _Hard Affinities_

Context Switching: Cost

Context Switching is **_Costly_** * Need to restart the _execution pipeline_ * Cached data is invalidated * OS Scheduler tries to avoid migration as much as possible * Tends to keep a thread in the same core * This is called _Soft Affinity_

What is Soft Affinity

The tendency of the OS Scheduler to keep a thread in the same core.

What are Hard Affinities

Affinities that are explicitly defined by programmers. ## Footnote Rule of Thumb: Use the default scheduler unless there is a good reason not to.

When to set your own Affinities

* Two (or more) threads share data-structures in memory: * map to the same core so they can share a cache * Real-Time threads: * Example: * A thread running a robot controller * Must not be context switched, or else robot can become unstable * Dedicate an entire core just to this thread

Multicore Flashcards

(37 cards)