MatrixDecryption

Use SSE Instructions (see lab 7): DONE

load C[jn to jn+n] in a register on the outermost loop (j). -store C[jn to jn+n] back into memory (sse) load A[kn to kn+m] in a register on the 2nd loop (k). -store A[kn to kn+m] back into memory (sse) leave innermost loop(i) as is

Optimize loop ordering (see lab 5): DONE -j -k -i
Implement Register Blocking (load data into a register once and then use it several times) store into register instead of going to cache every time use intel insts and store info as vectors

load C[jn to jn+n] in a register on the outermost loop (j). -store C[jn to jn+n] back into memory (sse) load A[kn to kn+m] in a register on the 2nd loop (k). -store A[kn to kn+m] back into memory (sse) leave innermost loop(i) as is

Implement Loop Unrolling (see lab 7) - do first

Use hadd to unroll loop further; i.e. more iterations covered by horizontal addition

increment every loop by 4*(num of unrolled iterations) unroll iterations of i (innermost loop)

fringe case: use same method as lab07 (sum.c), add extra check so that variable le less than height/width: DONE

Cache Blocking - next optimal number of blocks to have run script that increases/tests different numbers of blocksize 64 byte block = 512 bit block = 4 vectors/block = 16 floats/block
Compiler Tricks (minor modifications to your source code can cause the compiler to produce a faster program)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MatrixDecryption

Files

README.md

Latest commit

History

README.md

File metadata and controls

MatrixDecryption