Skip to content

Latest commit

 

History

History
42 lines (30 loc) · 1.42 KB

README.md

File metadata and controls

42 lines (30 loc) · 1.42 KB

MatrixDecryption

  1. Use SSE Instructions (see lab 7): DONE

load C[jn to jn+n] in a register on the outermost loop (j). -store C[jn to jn+n] back into memory (sse) load A[kn to kn+m] in a register on the 2nd loop (k). -store A[kn to kn+m] back into memory (sse) leave innermost loop(i) as is

  1. Optimize loop ordering (see lab 5): DONE -j -k -i

  2. Implement Register Blocking (load data into a register once and then use it several times) store into register instead of going to cache every time use intel insts and store info as vectors

load C[jn to jn+n] in a register on the outermost loop (j). -store C[jn to jn+n] back into memory (sse) load A[kn to kn+m] in a register on the 2nd loop (k). -store A[kn to kn+m] back into memory (sse) leave innermost loop(i) as is

  1. Implement Loop Unrolling (see lab 7) - do first

Use hadd to unroll loop further; i.e. more iterations covered by horizontal addition

increment every loop by 4*(num of unrolled iterations) unroll iterations of i (innermost loop)

fringe case: use same method as lab07 (sum.c), add extra check so that variable le less than height/width: DONE

  1. Cache Blocking - next optimal number of blocks to have run script that increases/tests different numbers of blocksize 64 byte block = 512 bit block = 4 vectors/block = 16 floats/block

  2. Compiler Tricks (minor modifications to your source code can cause the compiler to produce a faster program)