Lecture 7 (9/21/2009) — HW Parallelism

Last Time

Cache Oblivious
HW Parallelism Mechanisms

Parallelism

Parallelism can be done in hardware or software using parallel programming.

Pipelining: A form of parallelism but not through multiple resources. It is a rather a form of hiding latency by better utilizing our resources.
Examples of parallelism: Execution Pipeline, Memory Pipelines (issue multiple requests w/o waiting for previous requests), Software pipelines (use to overlap software blocks to hide latency)
Aspects of Parallelism:

Concurrency, Communication, Synchronization

Resources in a Parallel CPU

Execution: ALUs, multiprocess, multicore
Control: Sequencers, Instructions, Out of order schedulers (which task instructions and issue them to ALU)
State: Registers, Memory
Networks: Interconnects (such as a bus)

Communication & Synchronization

Synchronization

Clock-this tells you passed time but does not tell you to do at every time step
Explicit Signals-dependencies
Implicit Signal-Signals generated internally to maintain execution semantics, generally used for pipeline control

Communication

Bypass networks, Registers, Memory, Explicit signals

Key Parallelism Analysis Points

Parallel Resources
Synchronization
Communication
Shared/Partitioned resources

Kinds of Parallelism

ILP - Instruction Level Parallelism
DLP – Data-Level Parallelism
TLP – Task-Level Parallelism

Analysis of Parallelism for Different System

Superscalar (ILP for multiple ALUs) Lecture 6 slide 8

ILP: multiple ALUs, multiple instructions at once

Parallel Resource: multiple ALUs
Synchronization: Explicit signals communicating dependencies
Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory
Shared: Sequencer, scheduler, set of registers, memory, network
Partitioned: Instructions

SMT/TLS (ILP for multiple ALUs) Lecture 6 slide 11

ILP: control of ALUs is done by ILP type mechanism
TLP: The 2 sequencers

Parallel Resources: multiple ALUs
Synchronization: Explicit signals communicating dependencies
Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory
Shared: set of registers, memory, network, ALUs
Partitioned: Instructions, sequencer, arch registers

VLIW (ILP for multiple ALUs) Lecture 6 slide 15

ILP: control of ALUs is done by ILP type mechanism

Parallel Resources: multiple ALUs
Synchronization: Explicit signals communicating dependencies
Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory, explicit signal
Shared: set of registers, memory, network,
Partitioned: Instructions, ALUs

Explicit Dataflow (ILP for multiple ALUs) Lecture 6 slide 18

ILP: control of ALUs is done by ILP type mechanism

Parallel Resources: multiple ALUs
Synchronization: Explicit signals communicating dependencies
Communication: Registers (transfer data from memory to ALUs), explicit signals
Shared: sequencer, memories, network
Partitioned: Instructions, ALUs, OOO

SIMD (DLP for multiple ALUs) Lecture 6 slide 20

DLP: single instruction acting on multiple data

Parallel Resources: multiple ALUs, multiple local memories
Synchronization: clock, compiler
Communication: explicit signals
Shared: sequencer, instructions
Partitioned: registers, memories, ALUs

MIMD – Shared Memory (TLP for multiple ALUs)

TLP: Multiple tasks running together, for example having multiple sequencers each of which has its own task of instructions to keep track of.
DLP: single instruction acting on multiple data

Parallel Resources: multiple ALUs, , sequencers, schedulers
Synchronization: explicit signals, memory
Communication: memory
Shared: memories, network
Partitioned: sequencers, instructions, ALUs, registers, some networks

MIMD – Distributed Memory (TLP for multiple ALUs)

TLP: Multiple tasks running together, for example having multiple sequencers each of which has its own task of instructions to keep track of.
DLP: single instruction acting on multiple data

Parallel Resources: multiple ALUs, , sequencers, schedulers, memory
Synchronization: explicit signals
Communication: explicit signals
Shared: network
Partitioned: sequencers, instructions, ALUs, registers, some networks, memories

Summary of communication and synchronization

Attach:1.jpg Δ

Summary of sharing in ILP HW

Attach:2.jpg Δ

Summary of sharing in DLP and TLP

Attach:3.jpg Δ

The problem with shared resources is they are hard to scale.

ILP/DLP/TLP in software

ILP, DLP and TLP are not only for hardware but applicable for software. From programmers’ perspective, parallelism is inside dataflow graph. There are different kinds of parallelism inside software:

* scheduling instruction is ILP

* straight-line code (sequence of expressions) is ILP

* Controls are DLP/ILP

* Loops might be DLP

* Procedures are kind of TLP (from different tasks in a pipelining perspective)

The difference between TLP and DLP in software:

* DLP comes from acting on different data.

* TLP come from different program constructs.

* Very ambiguous line between the two: sometimes we can say the scope of DLP is smaller than TLP

* Pipelining is definitely TLP

* DLP exists when different algorithms are inside the same data sets

Conversion

* Loop unrolling is a way to convert DLP to ILP

* Software pipelining can be considered as converting TLP/DLP to ILP

* ILP can be converted to TLP by TLS

In summary, DLP can be converted to TLP/ILP and TLP can be converted to ILP. To convert from TLP to DLP and from ILP to DLP is possible but unnatural and not very efficient.

Side Discussions

VLIW

VLIWs don’t have a scheduler but rather go from sequencer directly to ALUs. The instruction itself describes the control of the ALUs.
Normal VLIW machines must use all ALUs in one instruction. If not enough independent operation can be found in the program to utilize all the ALUs then noops are substituted.

Sequencer

A sequencer is generally used as the front end portion of a pipeline. It takes in the instruction that’s going to be executed and put it in the scheduler.

Vector CPU VS. SIMD

The difference between the two is whether we have local memories or not. In vector CPUs, memory is shared and thus there is no local memmory. In Vector, one address is needed and data comes from executive address. All ALUs do the same operation to the different elements in this “vector” of data with the first ALU acting on the first element.