Algorithms to utilize Locality
Matrix Multiplication
for i = 0 to N for j = 0 to M C[i][j]=0 for k = 0 to P C[i][j] += A[i][k] * B[k][j]
Assumption:
- Matrices are stored in a row major format.
- Matrices are very large in size such that tbe whole row cannot reside in the cache at one time.
Conclusion:
- Spatial Locality in A as elements are accessed row-wise.
- Temporal Locality in C as all the computations for calculation of a particular element in C is done at one time.
- Spatial Locality in C as elements are accessed row-wise.
- No locality in B due to large size of matrices.
Best algorithms to exploit locality:
- Stride-1/ “Streaming Algorithms”
- Stride - Distance in terms of memory addresses between consecutive reads.
To improve the locality in the above algorithm, B could be stored as column major.
- Increase in the cost of the operation
- Transpose would have to be done before the multiplication.
Optimization : Loop Interchange
for i = 0 to N for k = 0 to M for j = 0 to P C[i][j] += A[i][k] * B[k][j]
Spatial locality in A and B, but not in C. Conclusion:
- No temporal locality in C
- Temporal locality of C moved to A.
- Spatial Locality in B
- Spatial Locality in A.
Optimization: Blocking
for i = 0 to N, i++ for j = 0 to M, j=j+B for k = 0 to P, k++ for jj = j to j+B, jj++ C[i][jj] += A[i][k] * B[k][jj]
The above loop makes blocks of size B in the matrix B, so B rows are blocked together.
- This restores some temporal locality in C if block size B is small enough.
- The blocking can be done at all levels of caches and for both rows and columns.
- For blocking both rows and columns, 3 loops need to be set up with indexes ii, jj, and kk. This corresponds to 1 level of blocking, i.e., blocking for L1 cache.
- For L2 blocking iii, jjj and kkk indexes are used for both row and column blocking.
Optimization: Scalar Replacement
- Also called Array Promotion
- For using the registers so that multiple LOADs and STOREs can be changed to 1 LOAD and STORE.
- Temporal Locality in matrix C can be utilized by storing the intermediate values of C in registers.
for i = 0 to N, i++ for j = 0 to M, j=j++ C'=0 for k = 0 to P, k++ C' += A[i][k] * B[k][jj] C[i][j] = C'
- Another kind of optimization that can be done is Vector register Optimization, i.e., using built in SIMD vector registers for computation.
- In LOOP-INTERCHANGE optimization, A[i][k] can be placed into a register since it exhibits temporal locality.
- In BLOCKING optimization, loop unrolling can be done:
- Block Size < No of registers
- Interchange and unroll = Unroll and Jam
Register Optimizations
- Registers are closest to processor in memory hierarchy.
- As seen in scalar replacement reduces the number of loads and stores required.
- Better register allocation with fewer spills
- Compilers put values in a register and if they run out of registers for a new value, it stores the old value to a stack to be reused again. This is called a “spill”. A “fill” is the popping of value from this stack back into the register. The idea is to allocate registers in such a way that, the spills and fills of register values to and from the stack (where temporary values are stored) are reduced.
- Loop unrolling can be done to improve register utilization
- Similar to scalar replacement
for i = 0 to N, i++ for j = 0 to N, j=j++ C1=0, C2=0; for k = 0 to N, k+=2 A1 = A[i][k]; A2 = A[i][k+1]; C1 += A1 * B[k][j]; C2 += A1 * B[k][j+1]; C1 += A2 * B[k+1][j]; C2 += A2 * B[k+1][j+1]; C[i][j]=C1 C[i][j+1]=C2
Optimally utilizes the number of registers in a machine, if inner loop is blocked correctly.
Cache Aware Multiplication Algorithm
- This algorithm uses the blocking optimization.
- The block size is based on machine parameters so as to obtain optimal performance.
- Search tools can be used to calculate the machine parameters for optimal operation of this algorithm.
- All of the above optimizations can be done for Cache Aware Algorithm.