





- $R_i$  : set of memory locations read (input) by task  $T_i$
- $W_j$ : set of memory locations written (output) by task  $T_j$
- Two tasks  $T_1$  and  $T_2$  are parallel if
  - input to  $T_1$  is not part of output from  $T_2$
  - input to  $\underline{T}_2$  is not part of output from  $\underline{T}_1$
  - outputs from  ${\color{black}T_1}$  and  ${\color{black}T_2}$  do not overlap

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 ric Rabbah, IBM (c) Rodric Rabbah, 2007 and Mattan Erez, 2008



| onstruction                                      |
|--------------------------------------------------|
| Structures                                       |
| ation<br>ns<br>lechanisms usec<br>allel programs |
|                                                  |

# Algorithm Structure Design Space

- Given a collection of concurrent tasks, what's the next step?
- Map tasks to units of execution (e.g., threads)
- Important considerations
  - Magnitude of number of execution units platform will support
  - Cost of sharing information among execution units
  - Avoid tendency to over constrain the implementationWork well on the intended platform
    - Flexible enough to easily adapt to different architectures

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 (c) Podric Pablab, 2007 and Mattan Frez. 2008

## Major Organizing Principle

- How to determine the algorithm structure that represents the mapping of tasks to units of execution?
- Concurrency usually implies major organizing principle

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 (c) Podric Pabliab, 2007 and Mattan Fraz, 2008

- Organize by tasks
- Organize by data decomposition
- Organize by flow of data

Organize by Tasks?

2





EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 (c) Rodric Rabbah. 2007 and Mattan Erez 2008







## Pipeline Throughput vs. Latency

- Amount of concurrency in a pipeline is limited by
  the number of stages
- Works best if the time to fill and drain the pipeline is small compared to overall running time
- Performance metric is usually the throughput
   Rate at which data appear at the end of the pipeline per time unit (e.g., frames per second)
- Pipeline latency is important for real-time applications
  - Time interval from data input to pipeline, to data output

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 Rabbah. IBM (c) Rodric Rabbah. 2007 and Mattan Erez. 2008

# Event-Based Coordination

- In this pattern, interaction of tasks to process data can vary over unpredictable intervals
- Deadlocks are a danger for applications that use this pattern
  - Dynamic scheduling has overhead and may be inefficient
     Granularity a major concern

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 Rabbah. IBM (c) Rodric Rabbah. 2007 and Mattan Erez. 2008



# Code Supporting Structures

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8

- Loop parallelism
- Master/Worker
- Fork/Join
- SPMD
- Map/Reduce





# Master/Worker Pattern

- Particularly relevant for problems using task parallelism pattern where task have no dependencies
  - Embarrassingly parallel problems
- Main challenge in determining when the entire problem is complete

EE382V: Prinicples in Computer Architecture, Fail 2008 -- Lecture 8 (c) Packic Pabhab 2007 and Matter Frez 2008

#### Fork/Join Pattern

- Tasks are created dynamically
   Tasks can create more tasks
- Manages tasks according to their relationship
- Parent task creates new tasks (fork) then waits until they complete (join) before continuing on with the computation

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 (c) Rodric Rabbah. 2007 and Mattan Erez. 2008

# SPMD Pattern

 Single Program Multiple Data: create a single source-code image that runs on each processor
 - Initialize

> EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 (c) Rodric Pabhab. 2007 and Mattan Frez. 2008

- Obtain a unique identifier
- Run the same program each processor
  Identifier and input data differentiate behavior
- Distribute data
- Finalize

#### SPMD Challenges

- Split data correctly
- Correctly combine the results
- Achieve an even distribution of the work
- For programs that need dynamic load balancing, an alternative pattern is more suitable

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8

Map/Reduce Pattern
Two phases in the program
Map phase applies a single function to all

data

- Each result is a tuple of value and tag
- Reduce phase combines the results
  - The values of elements with the same tag are combined to a single value per tag -- *reduction*
  - Semantics of combining function are associative
- Can be done in parallel
- Can be pipelined with map
- Google uses this for all their parallel programs

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 dric Rabbah, IBM (c) Rodric Rabbah, 2007 and Mattan Erez, 2008

### Communication and Synchronization Patterns

#### Communication

- Point-to-point
- Broadcast
- Reduction
- Multicast
- Synchronization
- Locks (mutual exclusion)
- Monitors (events)
- Barriers (wait for all)
  - Split-phase barriers (separate signal and wait)
     Sometimes called "fuzzy barriers"
  - Named barriers allow waiting on subset

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 Rabbah, IBM (c) Rodric Rabbah, 2007 and Mattan Erez, 2008

| (from the Book)     |                     |                          |                         |                   |          |                             |  |
|---------------------|---------------------|--------------------------|-------------------------|-------------------|----------|-----------------------------|--|
|                     | Task<br>parallelism | Divide<br>and<br>conquer | Geometric decomposition | Recursive<br>data | Pipeline | Event-based<br>coordination |  |
| SPMD                | ****                | ***                      | ****                    | **                | ***      | **                          |  |
| Loop<br>Parallelism | ****                | **                       | ***                     |                   |          |                             |  |
| Master/<br>Worker   | ****                | **                       | *                       | *                 | ****     | *                           |  |
| Fork/<br>Join       | **                  | ****                     | **                      |                   | ****     | ****                        |  |

 Patterns can be hierarchically composed so that a program uses more than one pattern

EE382V: Prinicples in Computer Architecture, Fall 2008 -- Lecture 8 C Rabbah. IBM (c) Rodric Rabbah. 2007 and Mattan Erez. 2008

|                     | Task<br>parallelism | Divide<br>and<br>conquer | Geometric<br>decomposition | Recursive<br>data | Pipeline | Event-base<br>coordinatio |
|---------------------|---------------------|--------------------------|----------------------------|-------------------|----------|---------------------------|
| SPMD                |                     |                          |                            |                   |          |                           |
| Loop<br>Parallelism |                     |                          |                            |                   |          |                           |
| Master/<br>Worker   |                     |                          |                            |                   |          |                           |
| Fork/<br>Join       |                     |                          |                            |                   |          |                           |

|                     | Task<br>parallelism             | Divide<br>and<br>conquer | Geometric<br>decomposition | Recursive<br>data | Pipeline                     | Event-based<br>coordination |
|---------------------|---------------------------------|--------------------------|----------------------------|-------------------|------------------------------|-----------------------------|
| SPMD                | ****                            | **                       | ****                       | **                | ****                         | *                           |
| Loop<br>Parallelism | ****<br>when no<br>dependencies | *                        | ****                       | *                 | ****<br>SWP to hide<br>comm. |                             |
| Master/<br>Worker   | ****                            | ***                      | ***                        | ***               | **                           | ****                        |
| Fork/<br>Join       | ****                            | ****                     | **                         | ****              |                              | *                           |

EE382V: Prinlcples in Computer Architecture, Fall 2008 -- Lecture 8 Rabbah, IBM (c) Rodric Rabbah, 2007 and Mattan Erez. 2008