Lab 1 — locality in a CPU

Out: 8/24/2016 Due:9/2/2016,9/9/2016,9/19/2016

In this lab we will use dense matrix-matrix multiplication to study the effects of locality on performance and power dissipation. Matrix multiplication is deceptively simple and actually offers many opportunities for optimization. We also learn about 3 tools that can help with architecture optimizations (performance counters, CACTI, and PIN).

You can use just about any computer to run this lab and instructions are provided for the Linux machines. I recommend that you try to use your own machine for the timing part, and if you feel you need some help use TACC. When you compile your code, please use reasonable compiler optimization options. I suggest sticking to gcc and using -O2, if you want to experiment with other options see https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html. Don’t forget to report which flags you use and don’t forget that you learn a lot more by optimizing yourself when possible. You may also find it instructive to experiment with other compilers (llvm or, if using TACC machines icc). When comparing optimizations take a look at what’s actually produced — especially if any SIMD is being produced.

The lab has three parts with different due dates to “encourage” you to procrastinate less. From prior experience this lab takes time and you should start early. Read the whole thing first and please don’t hesitate to come to office hours to get help ‘’‘and complain about anything that might be wrong (which you can also do in Piazza).

Emphasis and grading

This lab is about locality and matrix multiplication is used as a teaching example. The optimization possibilities for this application are almost endless, and it is not the intent that you try to explore them all. Rather, try to focus on those that are directly related to the teaching/learning goals of this class. Basically, effort placed in optimizations that are not locality optimizations will not be rewarded (in grading) as much as ideas related to locality, methodology, or tools introduced in this lab.

Grading will address all the specific questions that appear below as well as any additional comments or ideas you share.

We will grade using a 5-level scale (I expect the final class grades to be in the A — B range with more ‘A’s and ‘A-‘s than ‘B’s):

5	Truly remarkable work
4	Exceeded expectations (specifically with regards to learning goals)
3	Met expectations
2	Did not meet all learning goals expected
1	Requires significant changes

Matrix multiplication

The following C code is the entire matrix multiplication program! We will only use the simple N^3 algorithm in this lab.

#include <stdlib.h>
 
// define the matrix dimensions A is MxP, B is PxN, and C is MxN
#define M 512
#define N 512
#define P 512
 
// calculate C = AxB
void matmul(float **A, float **B, float **C) {
  float sum;
  int   i;
  int   j;
  int   k;
 
  for (i=0; i<M; i++) {
    // for each row of C
    for (j=0; j<N; j++) {
      // for each column of C
      sum = 0.0f; // temporary value
      for (k=0; k<P; k++) {
        // dot product of row from A and column from B
        sum += A[i][k]*B[k][j];
      }
      C[i][j] = sum;
    }
  }
}
 
// function to allocate a matrix on the heap
// creates an mXn matrix and returns the pointer.
//
// the matrices are in row-major order.
void create_matrix(float*** A, int m, int n) {
  float **T = 0;
  int i;
 
  T = (float**)malloc( m*sizeof(float*));
  for ( i=0; i<m; i++ ) {
     T[i] = (float*)malloc(n*sizeof(float));
  }
  *A = T;
}
 
int main() {
  float** A;
  float** B;
  float** C;
 
  create_matrix(&A, M, P);
  create_matrix(&B, P, N);
  create_matrix(&C, M, N);
 
  // assume some initialization of A and B
  // think of this as a library where A and B are
  // inputs in row-major format, and C is an output
  // in row-major.
 
  matmul(A, B, C);
 
  return (0);
}
Baseline matrix-matrix multiplication.

[$[Get Code]]

Part 1 — Analytical Modeling (Due 9/2/2016)

In the next two parts we looked at 2 of the 4 most common ways of doing architecture research: measurements on real machines where we have limited or no control over parameters, and simulation, which can be slow and does not scale well. In this part we’ll look at analytical modeling, which can be very useful to evaluate trends and look at scaling beyond the capabilities of simulators. More importantly it lets you roughly verify that what you are measuring makes sense — which is very important. The final common technique is prototyping, by the way.

Question 1

Derive formulae for the locality of the original (triply-nested ijk loop), cache-aware (single-level tiled), and cache-oblivious implementations for one level of locality. The formulae should represent what fraction of accesses (access = operand read or write) are serviced by a local store of a given capacity. Assume the matrix is much larger than the storage capacity. you can think of this store as an ideal cache (i.e., LRU replacement, fully associative, with one word per cache line).

Draw a graph of the locality of each of the three implementations as the amount of storage grows.

If you get to this part before we cover this in class, please look at the following presentations from UIUC and Bryant and O’Hallaron for cache aware and the papers linked here.

Question 2

Can you extend your model to a storage hierarchy (e.g., registers, L1, L2, memory)?

Part 2 — Optimizing Performance (Due 9/9/2016)

In the first part of the lab we will try to improve the performance of the baseline version above on a uni-processor. Most of the performance to be gained is with locality optimizations, followed by converting the code to use modern processor’s short-vector SIMD units. We will focus on the locality part.

In order to improve performance we will need a good way to define and measure performance. We will use ‘GFLOPS’ (giga-FLOPS, billions of floating-point operations per second) as our measure of performance. To calculate the GFLOPS of our application we will need to measure the number of operations executed as well as the time spent in the calculation.

Measuring the number of operations in this case is very simple. All the computation occurs in the loop nest starting on line 15 and includes one addition and one multiplication in the inner-most loop, so the total number of computations is 2 \times M \times N \times P. In general it is important to count the actual number of operations required by the algorithm when measuring performance and make sure not to include operations that were added as part of coding or optimization.

Measuring time is a bit more tricky because we cannot simply use the built-in OS time measurement because of accuracy issues. To allow more accurate measurements processors provide performance counters in the hardware that measure various events, such as cycles, instructions retired, branches, cache accesses, … We will use the performance counters to measure the number of cycles required by our application.

There are many measurement packages that ease the use of hardware performance counters. One of the best and best supported is PAPI, developed at the University of Tennessee Knoxville. Unfortunately, using PAPI requires a kernel patch to Linux, so we will use the perf tool instead. Tutorial and examples for perf usage are available at https://perf.wiki.kernel.org/index.php/Tutorial and perf example. We also give some simple usage examples in the steps below, but I suggest reading https://perf.wiki.kernel.org/index.php/Tutorial to figure out what is possible with perf. The lab will give you a taste of how to measure performance and the questions will point out some potential pitfalls. If you have access to PAPI and know, or want to learn, how to use it — feel free to do so.

  If you are particularly interested in the topic of measurement, I strongly suggest Whaley, R.C.  and  Castaldo, A.M.Achieving accurate and context-sensitive timing for code optimization.. ((URL)) (BibTeX) as a very practical and to-the-point paper.

Step 1

Install the perf package by running apt-get install linux-tools-common and apt-get install linux-tools if you’re using Ubuntu. If you’re using Fedora, yum install perf will work.

The following command shows an example of measuring user-level events (assuming your executable binary is named matmul).

perf stat -e [event,...] COMMAND [ARGS]

(Example usage) perf stat -e cycles:u,instructions:u,cache-references:u,cache-misses:u ./matmul

(Some useful events)

 cycles:u - Total cycles (Be careful of what happens during CPU frequency scaling with respect to performance)

 instructions:u - Retired instructions

 cache-references:u - Usually, indicates LLC accesses, and it may include prefetch, coherence message depending on your CPU

 cache-misses:u - Usually, indicates LLC misses

 L1-dcache-load:u - L1 cache load accesses

 L1-dcache-load-misses:u - L1 cache load misses

 L1-dcache-stores:u - L1 cache store accesses

 L1-dcache-store-misses:u - L1 cache store misses

[$[Get Code]]

The postfix :u is for measuring performance counters only for user-level instructions (default is measuring at both kernel and user levels).

You can add more monitored events in a single perf command, and some events are useful to understand and explain the differences between measurements and code versions (some of them are listed above).

The command perf list shows the events that can be counted in your system (please note that some of the listed events are not actually available in your CPU, though). You can find more details of each event at perf_event_open() documentation, but please note that exact meaning of each event may vary depending on your CPU.

When you increase the number of monitored events, however, you need to be careful because there is a limited number of hardware performance counters. According to perf wiki, when monitoring more events than there are available hardware performance counters, events are multiplexed over the counter registers and this may decrease accuracy. For example, the Intel 4-core i5–2520M 2.5GHz CPU has 11 hardware performance counter registers total. I recommend using fewer than 5 events simultaneously. If you use too many events in a single perf command, some events will not generate useful numbers (often registering a zero).

Due to dynamic frequency scaling, care must be taken when you convert a cycle count to execution time. One option is to fix CPU frequency of your CPU with cpufrequtils, but we are not considering it in this lab.

Note that your matrix multiplication software runs not only matmul(), but also create_matrix() and other functions, so estimated GFLOPS may be a bit off; still useful to evaluate performance of your software with different configurations and various optimization techniques. Please measure the performance of your application at least 20 times.

Question 3

How many GFLOPS did the basic version achieve on your test machine? Please use three different matrix sizes: M=N=P=32 , M=N=P=512, and M=N=P=2048.

Things to address: What machine did you use? Why are several measurements taken? Which one(s) should you report (please see the cache performance as well)? How did you convert from cycles to seconds and why did you do it that way?

Was there any significant difference between the two matrix sizes in terms of the trends in the timings and cache misses of each measurement? Can you make them behave the same way with regards to timing variance? Should you do it? Please include any code you used.

Step 2

Now try to improve locality and hence performance. Please think of optimizing registers as well as the memory hierarchy. Try both cache-aware and cache-oblivious methods. You may want to consider larger datasets for this step of the lab to test your ideas.

The A, B, and C matrix formats are all row-major (even though they are just malloced for simplicity (think of this as a library interface). Anything you do to manipulate A and B should be counted for time and accesses (but not count towards instruction count for performance).

Question 4

What techniques that you used worked well and what didn’t make much of a difference? Why? Please discuss how your optimization changes number of cache misses in each level, and how it affects overall performance. Please address the different locality mechanisms in hardware when answering this question and document failed and successful techniques (only correct ones of course).

Part 3 — Estimating Power (Due 9/19/2016)

As architects it is important for us to understand both the software costs (performance) and the hardware costs. Hardware costs are typically categorized as die area, complexity and power/energy. There are other costs to consider such as reliability, features, and so forth that are usually set by the usage model. We will only look at power/energy in this lab.

We discussed the effects locality has on power and energy in class. We will not attempt to develop a power model based on “first principles” and will instead use the well-regarded CACTI CACTI tool from HP Labs. CACTI models the hardware costs of caches, SRAMs, and DRAM and the latest version of CACTI (v6.5) can be downloaded at http://www.hpl.hp.com/research/cacti/cacti65.tgz. CACTI also has a web interface, but the web interface consistently uses and out-of-date version of CACTI and is also not as configurable.

To estimate the power and energy consumed by the matrix multiplication application we need to know both how much energy each access to a locality level takes and how many accesses are actually made to each level. To determine this we could use the hardware performance counters to measure cache accesses, but then we would not be able to explore the behavior as we change cache sizes. Therefore, we will use a simple cache simulator built on top of PIN, which is a dynamic hardware instrumentation tool.

Step 3

Download Pin from http://www.pintool.org/downloads.html and untar it (you can use /tmp if you don’t have much quota left). The source/tools/SimpleExamples directory contains dcache.cpp, which implements a one-level data cache. We want to have a 2-level cache model, which you can easily create by some artful copying and pasting. This is a good opportunity to learn how to modify existing tools to achieve your goals. Make sure you have some way of changing the cache parameters and sizes (you can just replicate the parameters of the L1 cache). For the cache model, please assume both caches are read- and write-allocate, write-back, and inclusive (this means allocate a line on any read or write miss, allocate in L2 on an L1 miss, and assume that a write-back from L1 never misses in L2 and doesn’t change its LRU). Note that the PIN tool will not work on some of the Linux distributions. The message you will get is this error message: “Can not load shared library libm.so.6″. Refer to the release note for tested configurations.

Once you have the two level model, verify it by running the following steps from within the source/tools/SimpleExamples directory. You will have to change obj-ia32 to obj-ia64 or obj-intel64 depending on the architecture of the CPU you are using. Set the L1 parameters to (size=32KB, associativity=4, blcck size=32) and the L2 to (size=2048KB, associativity=16,block size=64).

make
wget http://lph.ece.utexas.edu/class/Sp15EE382N/Lab1?action=download&upname=test_cache.c.txt
mv test_cache.c.txt test_cache.c
wget http://lph.ece.utexas.edu/downloads/EE382V_Principles/test_cache.out
gcc -O0 test_cache.c -o test_cache
../../../pin -t obj-ia32/dcache.so -- ./test_cache
diff dcache.out test_cache.out

[$[Get Code]]

If there are small differences in the numbers ignore them, but if not, one of us made a mistake.

You’re now ready to measure the locality in your matrix multiplication code. Just run you PIN dcache tool as for test_cache, but use your matrix multiplication program instead.

Question 5

Please fill in the table below with the number of references made to each locality level. For registers, just look at the number of accesses required for the actual computation, which you can calculate based on the code. Please use the exact implementations you did in the previous steps. For all the runs, please set the L1 associativity to 4 and the L2 associativity to 16. Some of these runs may take an hour or so.

To find out the cache parameters of the processor you are using just search for the specific model number on the web. Your processor model appears in Control Panel/System on Windows machines and in /proc/cpuinfo (text file) on Linux machines.

N=M=P	L1 Size	L1 Block	L2 Size	L2 Block	Original				Cache Aware				Cache Oblivious
					Regs	L1	L2	MEM	Regs	L1	L2	MEM	Regs	L1	L2	MEM
2048	Your processor params
2048	32	64	1024	128
512	8	32	256	64
512	1	16	128	64

Why do you think the results differ between the configurations and between the cache-aware and cache-oblivious implementation? Can you make the cache-aware and cache-oblivious behave similarly? If yes, please fill in another table for the new version(s).

Step 4

In this step we will use CACTI to estimate the potential power savings of locality optimizations. Use cacti v6.5 to fill in the table below.

For registers, copy the configuration file cache.cfg to a new file called register.cfg, and edit it to represent a “cache” with 128 entries of 32 bits each (this is not accurate, but gives a good sense of things). Set ‘-cache type’ to ‘ram’ with 1 input and one output port at 350K and 1 bank (also choose ITRS-HP, Conservative, and semi-global). Use 45 as the technology, which is similar to the 45nm technology in use today.

For caches, use cache.cfg, and configure it appropriately. Set ‘-cache type’ to ‘cache’ with 1 read-write port at 350K and 1 bank (also choose ITRS-HP, Conservative, and semi-global). Use 45 as the technology, which is similar to the 45nm technology in use today.

For DRAM, use dram.cfg, and set size to 512MB, one read/write port, commodity DRAM rather than ITRS-HP, number of bits read 128, and global wires.

	Size	Ways	Block	Access Energy
Your L1
Your L2
Registers	512	1	4
L1	32K	4	64
L1	8K	4	32
L1	1K	2	16
L2	1024K	16	128
L2	256K	16	64
L2	128K	16	64
DRAM	512MB	1	16

Question 6

Now combine steps 3 and 4 together and calculate the energy savings in the table below. Fill in the cache-aware and cache-oblivious columns as fraction of original. What conclusions can you draw?

N=M=P	L1 Size	L1 Block	L2 Size	L2 Block
2048	Your processor params
2048	32	64	1024	128
512	8	32	256	64
512	1	16	128	64

Question 7

What is the reduction in power consumption (think about this carefully)?