Algorithms to utilize Locality
- Matrix Multiplication
- Cache Aware Multiplication Algorithm

Algorithms to utilize Locality

Matrix Multiplication

for i = 0 to N
  for j = 0 to M
    C[i][j]=0
    for k = 0 to P
      C[i][j] += A[i][k] * B[k][j]

Assumption:

Matrices are stored in a row major format.
Matrices are very large in size such that tbe whole row cannot reside in the cache at one time.

Conclusion:

Spatial Locality in A as elements are accessed row-wise.
Temporal Locality in C as all the computations for calculation of a particular element in C is done at one time.
Spatial Locality in C as elements are accessed row-wise.
No locality in B due to large size of matrices.

Best algorithms to exploit locality:

Stride-1/ “Streaming Algorithms”
- Stride - Distance in terms of memory addresses between consecutive reads.

To improve the locality in the above algorithm, B could be stored as column major.

Increase in the cost of the operation
- Transpose would have to be done before the multiplication.

Optimization : Loop Interchange

for i = 0 to N
  for k = 0 to M
    for j = 0 to P
      C[i][j] += A[i][k] * B[k][j]

Spatial locality in A and B, but not in C. Conclusion:

No temporal locality in C
Temporal locality of C moved to A.
Spatial Locality in B
Spatial Locality in A.

Optimization: Blocking

for i = 0 to N, i++
  for j = 0 to M, j=j+B
    for k = 0 to P, k++
      for jj = j to j+B, jj++
        C[i][jj] += A[i][k] * B[k][jj]

The above loop makes blocks of size B in the matrix B, so B rows are blocked together.

This restores some temporal locality in C if block size B is small enough.
The blocking can be done at all levels of caches and for both rows and columns.
For blocking both rows and columns, 3 loops need to be set up with indexes ii, jj, and kk. This corresponds to 1 level of blocking, i.e., blocking for L1 cache.
For L2 blocking iii, jjj and kkk indexes are used for both row and column blocking.

Optimization: Scalar Replacement

Also called Array Promotion
For using the registers so that multiple LOADs and STOREs can be changed to 1 LOAD and STORE.
Temporal Locality in matrix C can be utilized by storing the intermediate values of C in registers.

for i = 0 to N, i++
  for j = 0 to M, j=j++
    C'=0
    for k = 0 to P, k++
        C' += A[i][k] * B[k][jj]
    C[i][j] = C'

Another kind of optimization that can be done is Vector register Optimization, i.e., using built in SIMD vector registers for computation.
In LOOP-INTERCHANGE optimization, A[i][k] can be placed into a register since it exhibits temporal locality.
In BLOCKING optimization, loop unrolling can be done:
- Block Size < No of registers
- Interchange and unroll = Unroll and Jam

Register Optimizations

Registers are closest to processor in memory hierarchy.
As seen in scalar replacement reduces the number of loads and stores required.
Better register allocation with fewer spills
- Compilers put values in a register and if they run out of registers for a new value, it stores the old value to a stack to be reused again. This is called a “spill”. A “fill” is the popping of value from this stack back into the register. The idea is to allocate registers in such a way that, the spills and fills of register values to and from the stack (where temporary values are stored) are reduced.
Loop unrolling can be done to improve register utilization
- Similar to scalar replacement

for i = 0 to N, i++
  for j = 0 to N, j=j++
    C1=0, C2=0;
    for k = 0 to N, k+=2
        A1 = A[i][k];
        A2 = A[i][k+1];
        C1 += A1 * B[k][j];
        C2 += A1 * B[k][j+1];
        C1 += A2 * B[k+1][j];
        C2 += A2 * B[k+1][j+1];
    C[i][j]=C1
    C[i][j+1]=C2

Optimally utilizes the number of registers in a machine, if inner loop is blocked correctly.

Cache Aware Multiplication Algorithm

This algorithm uses the blocking optimization.
The block size is based on machine parameters so as to obtain optimal performance.
Search tools can be used to calculate the machine parameters for optimal operation of this algorithm.
All of the above optimizations can be done for Cache Aware Algorithm.