Lecture 7 (9/21/2009) — HW Parallelism
Last Time
- Cache Oblivious
- HW Parallelism Mechanisms
- Parallelism
- Resources in a Parallel CPU
- Communication & Synchronization
- Key Parallelism Analysis Points
- Kinds of Parallelism
- Analysis of Parallelism for Different System
- Superscalar (ILP for multiple ALUs) Lecture 6 slide 8
- SMT/TLS (ILP for multiple ALUs) Lecture 6 slide 11
- VLIW (ILP for multiple ALUs) Lecture 6 slide 15
- Explicit Dataflow (ILP for multiple ALUs) Lecture 6 slide 18
- SIMD (DLP for multiple ALUs) Lecture 6 slide 20
- MIMD Shared Memory (TLP for multiple ALUs)
- MIMD Distributed Memory (TLP for multiple ALUs)
- Summary of communication and synchronization
- Summary of sharing in ILP HW
- Summary of sharing in DLP and TLP
- ILP/DLP/TLP in software
- Side Discussions
Parallelism
Parallelism can be done in hardware or software using parallel programming.
- Pipelining: A form of parallelism but not through multiple resources. It is a rather a form of hiding latency by better utilizing our resources.
- Examples of parallelism: Execution Pipeline, Memory Pipelines (issue multiple requests w/o waiting for previous requests), Software pipelines (use to overlap software blocks to hide latency)
- Aspects of Parallelism:
Concurrency, Communication, Synchronization
Resources in a Parallel CPU
- Execution: ALUs, multiprocess, multicore
- Control: Sequencers, Instructions, Out of order schedulers (which task instructions and issue them to ALU)
- State: Registers, Memory
- Networks: Interconnects (such as a bus)
Communication & Synchronization
Synchronization
- Clock-this tells you passed time but does not tell you to do at every time step
- Explicit Signals-dependencies
- Implicit Signal-Signals generated internally to maintain execution semantics, generally used for pipeline control
Communication
- Bypass networks, Registers, Memory, Explicit signals
Key Parallelism Analysis Points
- Parallel Resources
- Synchronization
- Communication
- Shared/Partitioned resources
Kinds of Parallelism
- ILP - Instruction Level Parallelism
- DLP Data-Level Parallelism
- TLP Task-Level Parallelism
Analysis of Parallelism for Different System
Superscalar (ILP for multiple ALUs) Lecture 6 slide 8
- ILP: multiple ALUs, multiple instructions at once
- Parallel Resource: multiple ALUs
- Synchronization: Explicit signals communicating dependencies
- Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory
- Shared: Sequencer, scheduler, set of registers, memory, network
- Partitioned: Instructions
SMT/TLS (ILP for multiple ALUs) Lecture 6 slide 11
- ILP: control of ALUs is done by ILP type mechanism
- TLP: The 2 sequencers
- Parallel Resources: multiple ALUs
- Synchronization: Explicit signals communicating dependencies
- Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory
- Shared: set of registers, memory, network, ALUs
- Partitioned: Instructions, sequencer, arch registers
VLIW (ILP for multiple ALUs) Lecture 6 slide 15
- ILP: control of ALUs is done by ILP type mechanism
- Parallel Resources: multiple ALUs
- Synchronization: Explicit signals communicating dependencies
- Communication: Registers (transfer data from memory to ALUs), Bypass networks, memory, explicit signal
- Shared: set of registers, memory, network,
- Partitioned: Instructions, ALUs
Explicit Dataflow (ILP for multiple ALUs) Lecture 6 slide 18
- ILP: control of ALUs is done by ILP type mechanism
- Parallel Resources: multiple ALUs
- Synchronization: Explicit signals communicating dependencies
- Communication: Registers (transfer data from memory to ALUs), explicit signals
- Shared: sequencer, memories, network
- Partitioned: Instructions, ALUs, OOO
SIMD (DLP for multiple ALUs) Lecture 6 slide 20
- DLP: single instruction acting on multiple data
- Parallel Resources: multiple ALUs, multiple local memories
- Synchronization: clock, compiler
- Communication: explicit signals
- Shared: sequencer, instructions
- Partitioned: registers, memories, ALUs
MIMD Shared Memory (TLP for multiple ALUs)
- TLP: Multiple tasks running together, for example having multiple sequencers each of which has its own task of instructions to keep track of.
- DLP: single instruction acting on multiple data
- Parallel Resources: multiple ALUs, , sequencers, schedulers
- Synchronization: explicit signals, memory
- Communication: memory
- Shared: memories, network
- Partitioned: sequencers, instructions, ALUs, registers, some networks
MIMD Distributed Memory (TLP for multiple ALUs)
- TLP: Multiple tasks running together, for example having multiple sequencers each of which has its own task of instructions to keep track of.
- DLP: single instruction acting on multiple data
- Parallel Resources: multiple ALUs, , sequencers, schedulers, memory
- Synchronization: explicit signals
- Communication: explicit signals
- Shared: network
- Partitioned: sequencers, instructions, ALUs, registers, some networks, memories
Summary of communication and synchronization
Summary of sharing in ILP HW
Summary of sharing in DLP and TLP
The problem with shared resources is they are hard to scale.
ILP/DLP/TLP in software
ILP, DLP and TLP are not only for hardware but applicable for software. From programmers’ perspective, parallelism is inside dataflow graph. There are different kinds of parallelism inside software:
* scheduling instruction is ILP
* straight-line code (sequence of expressions) is ILP
* Controls are DLP/ILP
* Loops might be DLP
* Procedures are kind of TLP (from different tasks in a pipelining perspective)
The difference between TLP and DLP in software:
* DLP comes from acting on different data.
* TLP come from different program constructs.
* Very ambiguous line between the two: sometimes we can say the scope of DLP is smaller than TLP
* Pipelining is definitely TLP
* DLP exists when different algorithms are inside the same data sets
- Conversion
* Loop unrolling is a way to convert DLP to ILP
* Software pipelining can be considered as converting TLP/DLP to ILP
* ILP can be converted to TLP by TLS
In summary, DLP can be converted to TLP/ILP and TLP can be converted to ILP. To convert from TLP to DLP and from ILP to DLP is possible but unnatural and not very efficient.
Side Discussions
VLIW
- VLIWs dont have a scheduler but rather go from sequencer directly to ALUs. The instruction itself describes the control of the ALUs.
- Normal VLIW machines must use all ALUs in one instruction. If not enough independent operation can be found in the program to utilize all the ALUs then noops are substituted.
Sequencer
- A sequencer is generally used as the front end portion of a pipeline. It takes in the instruction that’s going to be executed and put it in the scheduler.
Vector CPU VS. SIMD
- The difference between the two is whether we have local memories or not. In vector CPUs, memory is shared and thus there is no local memmory. In Vector, one address is needed and data comes from executive address. All ALUs do the same operation to the different elements in this “vector” of data with the first ALU acting on the first element.