Lecture 7 (9/21/2009) — HW Parallelism

Last Time

  • Cache Oblivious
  • HW Parallelism Mechanisms

Parallelism

Parallelism can be done in hardware or software using parallel programming.

  • Pipelining: A form of parallelism but not through multiple resources. It is a rather a form of hiding latency by better utilizing our resources.
  • Examples of parallelism: Execution Pipeline, Memory Pipelines (issue multiple requests w/o waiting for previous requests), Software pipelines (use to overlap software blocks to hide latency)
  • Aspects of Parallelism:

Concurrency, Communication, Synchronization

Resources in a Parallel CPU

  • Execution: ALUs, multiprocess, multicore
  • Control: Sequencers, Instructions, Out of order schedulers (which task instructions and issue them to ALU)
  • State: Registers, Memory
  • Networks: Interconnects (such as a bus)

Communication & Synchronization

Synchronization

  • Clock-this tells you passed time but does not tell you to do at every time step
  • Explicit Signals-dependencies
  • Implicit Signal-Signals generated internally to maintain execution semantics, generally used for pipeline control

Communication

  • Bypass networks, Registers, Memory, Explicit signals

Key Parallelism Analysis Points

  1. Parallel Resources
  2. Synchronization
  3. Communication
  4. Shared/Partitioned resources

Kinds of Parallelism

  • ILP - Instruction Level Parallelism
  • DLP – Data-Level Parallelism
  • TLP – Task-Level Parallelism

Analysis of Parallelism for Different System

Superscalar (ILP for multiple ALUs) Lecture 6 slide 8

  • ILP: multiple ALUs, multiple instructions at once
  1. Parallel Resource: multiple ALUs
  2. Synchronization: Explicit signals communicating dependencies
  3. Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory
  4. Shared: Sequencer, scheduler, set of registers, memory, network
  5. Partitioned: Instructions

SMT/TLS (ILP for multiple ALUs) Lecture 6 slide 11

  • ILP: control of ALUs is done by ILP type mechanism
  • TLP: The 2 sequencers
  1. Parallel Resources: multiple ALUs
  2. Synchronization: Explicit signals communicating dependencies
  3. Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory
  4. Shared: set of registers, memory, network, ALUs
  5. Partitioned: Instructions, sequencer, arch registers

VLIW (ILP for multiple ALUs) Lecture 6 slide 15

  • ILP: control of ALUs is done by ILP type mechanism
  1. Parallel Resources: multiple ALUs
  2. Synchronization: Explicit signals communicating dependencies
  3. Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory, explicit signal
  4. Shared: set of registers, memory, network,
  5. Partitioned: Instructions, ALUs

Explicit Dataflow (ILP for multiple ALUs) Lecture 6 slide 18

  • ILP: control of ALUs is done by ILP type mechanism
  1. Parallel Resources: multiple ALUs
  2. Synchronization: Explicit signals communicating dependencies
  3. Communication: Registers (transfer data from memory to ALUs), explicit signals
  4. Shared: sequencer, memories, network
  5. Partitioned: Instructions, ALUs, OOO

SIMD (DLP for multiple ALUs) Lecture 6 slide 20

  • DLP: single instruction acting on multiple data
  1. Parallel Resources: multiple ALUs, multiple local memories
  2. Synchronization: clock, compiler
  3. Communication: explicit signals
  4. Shared: sequencer, instructions
  5. Partitioned: registers, memories, ALUs

MIMD – Shared Memory (TLP for multiple ALUs)

  • TLP: Multiple tasks running together, for example having multiple sequencers each of which has its own task of instructions to keep track of.
  • DLP: single instruction acting on multiple data
  1. Parallel Resources: multiple ALUs, , sequencers, schedulers
  2. Synchronization: explicit signals, memory
  3. Communication: memory
  4. Shared: memories, network
  5. Partitioned: sequencers, instructions, ALUs, registers, some networks

MIMD – Distributed Memory (TLP for multiple ALUs)

  • TLP: Multiple tasks running together, for example having multiple sequencers each of which has its own task of instructions to keep track of.
  • DLP: single instruction acting on multiple data
  1. Parallel Resources: multiple ALUs, , sequencers, schedulers, memory
  2. Synchronization: explicit signals
  3. Communication: explicit signals
  4. Shared: network
  5. Partitioned: sequencers, instructions, ALUs, registers, some networks, memories

Summary of communication and synchronization

Attach:1.jpg Δ

Summary of sharing in ILP HW

Attach:2.jpg Δ

Summary of sharing in DLP and TLP

Attach:3.jpg Δ

The problem with shared resources is they are hard to scale.

ILP/DLP/TLP in software

ILP, DLP and TLP are not only for hardware but applicable for software. From programmers’ perspective, parallelism is inside dataflow graph. There are different kinds of parallelism inside software:

* scheduling instruction is ILP
* straight-line code (sequence of expressions) is ILP
* Controls are DLP/ILP
* Loops might be DLP
* Procedures are kind of TLP (from different tasks in a pipelining perspective)

The difference between TLP and DLP in software:

* DLP comes from acting on different data.
* TLP come from different program constructs.
* Very ambiguous line between the two: sometimes we can say the scope of DLP is smaller than TLP
* Pipelining is definitely TLP
* DLP exists when different algorithms are inside the same data sets
  • Conversion
* Loop unrolling is a way to convert DLP to ILP
* Software pipelining can be considered as converting TLP/DLP to ILP
* ILP can be converted to TLP by TLS

In summary, DLP can be converted to TLP/ILP and TLP can be converted to ILP. To convert from TLP to DLP and from ILP to DLP is possible but unnatural and not very efficient.

Side Discussions

VLIW

  • VLIWs don’t have a scheduler but rather go from sequencer directly to ALUs. The instruction itself describes the control of the ALUs.
  • Normal VLIW machines must use all ALUs in one instruction. If not enough independent operation can be found in the program to utilize all the ALUs then noops are substituted.

Sequencer

  • A sequencer is generally used as the front end portion of a pipeline. It takes in the instruction that’s going to be executed and put it in the scheduler.

Vector CPU VS. SIMD

  • The difference between the two is whether we have local memories or not. In vector CPUs, memory is shared and thus there is no local memmory. In Vector, one address is needed and data comes from executive address. All ALUs do the same operation to the different elements in this “vector” of data with the first ALU acting on the first element.