mini Lab 3 - Xeon Phi and OpenMP

Overview

In this lab, you will write an exclusive parallel scan using OpenMP that runs on Xeon Phi using OpenMP. The goal in this lab is to try many different optimizations and beat execution time of sequential version. You can find many useful tutorials on how to use OpenMP and the one with Xeon Phi. Some materials can be found here and here.

Instructions

  • Download starter package from canvas and copy into stampede machine and untar the starter_lab3.tgz. In this lab, we will be using ‘normal-mic’ queue. Serial execution time is printed for comparison. Assume that the size of the input is power of two.
  • icc will be used to compile.
  • To build, go to scan_starter directory and run ./build.sh
  • To execute, ./run.sh
  • You will mainly modify phi_exclusive_scan() function in main.cpp but you are welcome to modify different parts of the file if necessary. A #pragma statement commented out but this is a statement is required to offload a work to Xeon Phi. Please refer the example shown in here to find out how to offload work to MIC.

The following “C-like” code is an iterative version of exclusive scan. We can use OpenMP’s parallel_for directive to indicate potentially parallel loops. You are welcome to use the following code as a starting point.

  1.  
  2.  
  3. void exclusive_scan_iterative(int* start, int* end, int* output)
  4. {
  5.     int N = end - start;
  6.     memmove(output, start, N*sizeof(int));
  7.     // upsweep phase.
  8.     for (int twod = 1; twod < N; twod*=2)
  9.     {
  10.      int twod1 = twod*2;
  11.      parallel_for (int i = 0; i < N; i += twod1)
  12.      {
  13.          output[i+twod1-1] += output[i+twod-1];
  14.      }
  15.     }
  16.  
  17.     output[N-1] = 0;
  18.  
  19.     // downsweep phase.
  20.     for (int twod = N/2; twod >= 1; twod /= 2)
  21.     {
  22.      int twod1 = twod*2;
  23.      parallel_for (int i = 0; i < N; i += twod1)
  24.      {
  25.          int t = output[i+twod-1];
  26.          output[i+twod-1] = output[i+twod1-1];
  27.          output[i+twod1-1] += t; // change twod1 to twod to reverse prefix sum.
  28.      }
  29.     }
  30. }

Tips

  • Since Xeon Phi has 512 bit wide SIMD lane, you should take closer look at whether a loop was vectorized for MIC or not.
    • For icc compiler, ‘-vec-report2′ option will print out whether loops are vectorized or not.
  • A loop vectorization can be prevented if the compiler could not confirm there is no loop carried dependency even though there isn’t such dependency. In such case you can use some pragmas.
  • Think about cache on Xeon Phi and memory bandwidth.
  • You are allowed to use Xeon Phi intrinsics directly.
  • Prefetching can have greater impact than out of order cores.
  • OpenMP directive provides various choices of dividing the work and how it is scheduled.

Checking the execution time

Unlike lab 2, where we were able to measure the time directly right before and after kernel launch, measuring MIC execution time alone cannot be measured using the previous approach. However, we can measure it by setting an environment variable. Before executing the binary, type ‘export OFFLOAD_REPORT=2.’ or put it into the sbatch file. After execution, you can check the file in the log directory. The following is an example of output that is found in the log directory.

 [Offload] [MIC 0] [File]            main.cpp
 [Offload] [MIC 0] [Line]            80
 [Offload] [MIC 0] [Tag]             Tag2
 [Offload] [MIC 0] [CPU Time]        0.000000 (seconds)
 [Offload] [MIC 0] [CPU→MIC Data]   67108868 (bytes)
 [Offload] [MIC 0] [MIC Time]        1.038118 (seconds)
 [Offload] [MIC 0] [MIC→CPU Data]   67108864 (bytes)

MIC time should be reported.

Submission Guide

  • Write a report and the following aspects should be included.
    • Include both partners names and uteids at the top of your write-up.
    • Replicate the score table generated for your solution.
      • Additionally, report best MIC times for four inputs (“1″, “32768″, “16777216″, “67108864″).
    • Briefly describe how you arrived at your final solution. What other approaches did you try along the way. What was wrong with them?
    • How much memory/computation bandwidth is consumed? How does it when compared to theoretical max bandwidth of Xeon Phi? Memory bound or Compute Bound?
  • Compress your starter directory. The file name should be lab3.tgz
  • Submit the compressed file and your write up on Canvas.