- Use SSE Instructions (see lab 7): DONE
load C[jn to jn+n] in a register on the outermost loop (j). -store C[jn to jn+n] back into memory (sse) load A[kn to kn+m] in a register on the 2nd loop (k). -store A[kn to kn+m] back into memory (sse) leave innermost loop(i) as is
-
Optimize loop ordering (see lab 5): DONE -j -k -i
-
Implement Register Blocking (load data into a register once and then use it several times) store into register instead of going to cache every time use intel insts and store info as vectors
load C[jn to jn+n] in a register on the outermost loop (j). -store C[jn to jn+n] back into memory (sse) load A[kn to kn+m] in a register on the 2nd loop (k). -store A[kn to kn+m] back into memory (sse) leave innermost loop(i) as is
- Implement Loop Unrolling (see lab 7) - do first
Use hadd to unroll loop further; i.e. more iterations covered by horizontal addition
increment every loop by 4*(num of unrolled iterations) unroll iterations of i (innermost loop)
fringe case: use same method as lab07 (sum.c), add extra check so that variable le less than height/width: DONE
-
Cache Blocking - next optimal number of blocks to have run script that increases/tests different numbers of blocksize 64 byte block = 512 bit block = 4 vectors/block = 16 floats/block
-
Compiler Tricks (minor modifications to your source code can cause the compiler to produce a faster program)