Algorithms to utilize Locality

Matrix Multiplication

for i = 0 to N
  for j = 0 to M
    C[i][j]=0
    for k = 0 to P
      C[i][j] += A[i][k] * B[k][j]

Assumption:

  • Matrices are stored in a row major format.
  • Matrices are very large in size such that tbe whole row cannot reside in the cache at one time.

Conclusion:

  • Spatial Locality in A as elements are accessed row-wise.
  • Temporal Locality in C as all the computations for calculation of a particular element in C is done at one time.
  • Spatial Locality in C as elements are accessed row-wise.
  • No locality in B due to large size of matrices.

Best algorithms to exploit locality:

  • Stride-1/ “Streaming Algorithms”
    • Stride - Distance in terms of memory addresses between consecutive reads.

To improve the locality in the above algorithm, B could be stored as column major.

  • Increase in the cost of the operation
    • Transpose would have to be done before the multiplication.

Optimization : Loop Interchange

for i = 0 to N
  for k = 0 to M
    for j = 0 to P
      C[i][j] += A[i][k] * B[k][j]

Spatial locality in A and B, but not in C. Conclusion:

  • No temporal locality in C
  • Temporal locality of C moved to A.
  • Spatial Locality in B
  • Spatial Locality in A.

Optimization: Blocking

for i = 0 to N, i++
  for j = 0 to M, j=j+B
    for k = 0 to P, k++
      for jj = j to j+B, jj++
        C[i][jj] += A[i][k] * B[k][jj]

The above loop makes blocks of size B in the matrix B, so B rows are blocked together.

  • This restores some temporal locality in C if block size B is small enough.
  • The blocking can be done at all levels of caches and for both rows and columns.
  • For blocking both rows and columns, 3 loops need to be set up with indexes ii, jj, and kk. This corresponds to 1 level of blocking, i.e., blocking for L1 cache.
  • For L2 blocking iii, jjj and kkk indexes are used for both row and column blocking.

Optimization: Scalar Replacement

  • Also called Array Promotion
  • For using the registers so that multiple LOADs and STOREs can be changed to 1 LOAD and STORE.
  • Temporal Locality in matrix C can be utilized by storing the intermediate values of C in registers.
for i = 0 to N, i++
  for j = 0 to M, j=j++
    C'=0
    for k = 0 to P, k++
        C' += A[i][k] * B[k][jj]
    C[i][j] = C'
  • Another kind of optimization that can be done is Vector register Optimization, i.e., using built in SIMD vector registers for computation.
  • In LOOP-INTERCHANGE optimization, A[i][k] can be placed into a register since it exhibits temporal locality.
  • In BLOCKING optimization, loop unrolling can be done:
    • Block Size < No of registers
    • Interchange and unroll = Unroll and Jam

Register Optimizations

  • Registers are closest to processor in memory hierarchy.
  • As seen in scalar replacement reduces the number of loads and stores required.
  • Better register allocation with fewer spills
    • Compilers put values in a register and if they run out of registers for a new value, it stores the old value to a stack to be reused again. This is called a “spill”. A “fill” is the popping of value from this stack back into the register. The idea is to allocate registers in such a way that, the spills and fills of register values to and from the stack (where temporary values are stored) are reduced.
  • Loop unrolling can be done to improve register utilization
    • Similar to scalar replacement
for i = 0 to N, i++
  for j = 0 to N, j=j++
    C1=0, C2=0;
    for k = 0 to N, k+=2
        A1 = A[i][k];
        A2 = A[i][k+1];
        C1 += A1 * B[k][j];
        C2 += A1 * B[k][j+1];
        C1 += A2 * B[k+1][j];
        C2 += A2 * B[k+1][j+1];
    C[i][j]=C1
    C[i][j+1]=C2

Optimally utilizes the number of registers in a machine, if inner loop is blocked correctly.

Cache Aware Multiplication Algorithm

  • This algorithm uses the blocking optimization.
  • The block size is based on machine parameters so as to obtain optimal performance.
  • Search tools can be used to calculate the machine parameters for optimal operation of this algorithm.
  • All of the above optimizations can be done for Cache Aware Algorithm.