Lecture 8 Notes

Mattan’s slides

Parallelism in Software
Parallel Programming
- Starting from Scratch
- Reengineering an Existing Program for Parallelism
Parallelizing a Program
Patterns for Parallel Programming
- Patterns for Decomposition

Parallelism in Software

Parallelism exists in software inherently to some degree; it may be in the form of ILP, DLP, TLP, or a combination of these three. The important point is to figure out how extract and exploit this inherent parallelism. In general, ILP, DLP, and TLP can be found in the following areas:

ILP (Mostly taken care of by the compiler):
- Straight-line code with no branches and dependencies
- Code in which conditionals and control signals can be pre-determined by the compiler
DLP:
- Loops with no dependencies across iterations
TLP:
- Independent procedures
- Independent tasks

There is no clear cut in determining the type of parallelism exists in the code. Especially, DLP/TLP boundary is very vague. For the purposes of this class, we will use the following definition to differentiate between DLP and TLP from software perspective:

DLP:
- DLP acts on different data elements or sets.
TLP:
- TLP comes from different program constructs.
- Data is potentially shared between different tasks.
- Pipelining also exploits TLP.

Hardware has the final say on how to extract parallelism from the software. Furthermore, it can convert one form of parallelism into another one:

Easy path: DLP → TLP → ILP
Hard & inefficient path: ILP → TLP → DLP (Reason: Requires analysis and speculation. Predication is a tool used for this type of conversion at the expense of losing efficiency.)

There are many examples of conversion in software and hardware:

Software:
- Loop unrolling (DLP → ILP)
- Software pipelining (DLP/TLP → ILP)
- Creating threads (DLP → TLP)
Hardware:
- SMT - Simultaneous Multi-Threading (DLP/TLP → ILP)
- TLP - Thread-Level Speculation (ILP → TLP)

Parallel Programming

Starting from Scratch

Here are the steps usually a programmer takes when he/she begins programming code in parallel:

Start with and algorithm
Determine the parallelism in the algorithm and minimize synchronization
Consider and take advantage of locality as communication is costly

Possibly one may start with existing sequential code and reengineer it to extract parallelism.

Reengineering an Existing Program for Parallelism

In reengineering for parallelism, there are four common steps:

Partitioning - Decomposition
Partitioning - Assignment
Partitioning - Orchestration
Mapping

Reengineering usually starts with a sequential code. Then, the constructs and patterns in the sequential code should be observed in order to find which ones can be parallelized. Expectations from the modified code should be defined before the programmer begins to make modifications as a feasibility check. Furthermore, a testing protocol also needs to be laid out beforehand. Finally, the parts of the code that consumes most of the execution time should be given more priority in parallelization; profiling tools come very handy in doing this.

Parallelizing a Program

Decomposition

In decomposition stage, the programmer needs to take the following steps to break the code down into parts that can be run in parallel:

Identify concurrency and ways to exploit it.
Break up computation into tasks, then tasks into processes. One should not forget that the availability and number of tasks will most likely vary in time.
Make sure to have enough tasks to keep processors busy.

The amount of speedup that can be achieved by decomposition is defined by coverage and Amdahl’s Law, which states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. As a result, the importance of parallelism increases as more of the execution can be done in parallel.

 Amdahl’s Law:
 p = Fraction of work that can be parallelized
 n = The number of processors

             Old running time            1
 Speedup = -------------------- = ---------------
             New running time       (1-p)+(p/n)

Although Amdahl’s Law does not allow super-linear speedup, it is achievable due to availability of more registers and caches. Furthermore, the new arrangement of the code in parallel may be more optimal in taking advantage of locality.

Assignment

Assignment is the task of putting the decomposed processes together into blocks at some coarse level of granularity so that they can be shipped off to a processor. We need to specify mechanisms to divide the work among processing elements and make sure the work load is well-balanced. Structured approaches usually work well in balancing the load. Furthermore, the partitioning of data and taking advantage of locality is another important aspect of assignment.

	Fine-Grain Parallelism	Coarse-Grain Parallelism
Computation/Communication ratio	Low	High
Computational work between communication stages	Small	Large
Communication/Synchronization overhead	High	Low
Load balance efficiency	High	Low

In general, when synchronization is expensive, coarse granularity is preferred whereas if there are few units of execution and time disparity, fine granularity is desired.

Orchestration & Mapping

In this step, we need to consider the following points to ensure we get the most out of our parallelization process:

Computation and communication should be overlapped and be handled concurrently.
Locality of data should be preserved.
Scheduling should be done in such a way that dependencies should be satisfied as early as possible.
The target system should be surveyed if possible to take advantage of all available mechanisms.

Patterns for Parallel Programming

Parallel programming by patterns helps parallel programmers in the following ways:

Provides a systematical guide
Provides a common vocabulary
Helps with software reusability, malleability, and modularity

Patterns for Decomposition

There are three common patterns for decomposition:

Task decomposition:

Task decomposition usually exists naturally within a program. It is better to start with many tasks then a few tasks as this will give us flexibility. However, this finer granularity should not hinder efficiency. Furthermore, in finding and creating new tasks, the simplicity of the code should be maintained.

Pipeline task decomposition:

When tasks are complex, they often contain a pipeline, which increases throughput at the expense of increasing latency.

Data decomposition:

Data decomposition looks at the problem from a perspective, in which data partitioning is given priority over task partitioning. Then, tasks can be assigned for each partition of data. The major advantage of data decomposition is that data parallelism is usually a lot more scalable than task parallelism. Just as in task decomposition, flexibility, efficiency, and simplicity should be the three major points to be considered in data decomposition. Two common data decomposition patterns are geometric data structures (decomposition of arrays along rows, columns, blocks; decomposition of meshes into domains) and recursive data structures (decomposition of trees into sub-trees).