Mattan Erez Courses : Fa09EE382N - Lect 6 Notes browse

Cache Oblivious Algorithm
Parallelism in Hardware

Cache Oblivious Algorithm

Slides by Volker Strumpen
To run an algorithm on a system, when the machine parameters like cache size are not known,Cache Oblivious Algorithm is used.
It uses a blocking that is asymptotically optimal for all cases.

This method tries to recursively decompose the input matrices into smaller matrices, by a procedure called leaf-unfolding.

void recur(int i0, int i1,int j0, int j1, int k0, int k1)

    { 

        int i, j, k, di = i1 - i0, dj = j1 - j0, dk = k1 - k0;

        const int CUTOFF = 8;

        if (di >= dj && di >= dk && di > CUTOFF) {

            int im = (i0 + i1) / 2;

            recur(i0,im,j0,j1,k0,k1);                         

            recur(im,i1,j0,j1,k0,k1);

        } else if (dj >= dk && dj > CUTOFF) {

            int jm = (j0 + j1) / 2;

            recur(i0,i1,j0,jm,k0,k1);                   

            recur(i0,i1,jm,j1,k0,k1);

        } else if (dk > CUTOFF) {

            int km = (k0 + k1) / 2;

            recur(i0,i1,j0,j1,k0,km);                

            recur(i0,i1,j0,j1,km,k1);

        } else {

            for (i = i0; i < i1; ++i) 

                for (j = j0; j < j1; ++j)

                    for (k = k0; k < k1; ++k)

                        C[i][j] += A[i][k] * B[k][j];

        }

    }

[$[Get Code]]

Explanation

To understand the afore-mentioned program, we could take a simple example. Consider 2 input matrices A and B of dimensions i x k and k x i respectively. The resultant matrix C = A x B is therefore of dimensions i x j. Let us assume i=j=k=4 and CUTOFF=2. To start with i0=0, i1=4, j0=0, j1=4, k0=0, k1=4. The dimensions of A, B and C are i x k, k x j and i x j respectively.

The various levels of the function call tree are listed below.

Level 1

When recur() is invoked with the given arguments, this statement executes first, making two calls to recur() -

if (di >= dj && di >= dk && di > CUTOFF) {

            int im = (i0 + i1) / 2;

            recur(i0,im,j0,j1,k0,k1);                         

            recur(im,i1,j0,j1,k0,k1);

    }

[$[Get Code]]

Level2

Each of the two calls to recur() executes this portion of code -

else if (dj >= dk && dj > CUTOFF) {

            int jm = (j0 + j1) / 2;

            recur(i0,i1,j0,jm,k0,k1);                   

            recur(i0,i1,jm,j1,k0,k1);

    }

[$[Get Code]]

Level 3

Each of the 4 calls to recur() executes this portion of code -

else if (dk > CUTOFF) {

            int km = (k0 + k1) / 2;

            recur(i0,i1,j0,j1,k0,km);                

            recur(i0,i1,j0,j1,km,k1);

    }

[$[Get Code]]

Level 4

Now each of the 8 calls to recur() executes the base code, as we had defined a CUTOFF of 2 -

else {

          for (i = i0; i < i1; ++i) 

              for (j = j0; j < j1; ++j)

                  for (k = k0; k < k1; ++k)

                      C[i][j] += A[i][k] * B[k][j];

    }

[$[Get Code]]

At this level, matrix multiplication is performed for matrices of dimensions 2 x 2. The actual computation re-uses elements from the matrix periodically, and since the size of the matrices is small, there is a lesser chance of matrix elements being evicted from cache.

The recursive approach scales very well for larger matrix dimensions. The following comparison confirms that the recursive approach performs better than the iterative approach.

The recursive approach for matrix transposition and multiplication is cache-oblivious in that

nothing is known about the cache hierarchy, line size, etc.
it attempts to improve locality by means of smaller working sets for computation.
all cache levels are used asymptotically optimally.

Parallelism in Hardware

Mattan’s slides Δ