$ A (M \times k)$,
$ C = \alpha \times A\dot B + \beta \times C$
block$(32, 32)$ have 1024 thread, grid$((M-1)/32+1, (N-1)/32+1)$
pointer(A, B, C) minus offset to avoid branches in the loop. and saving threadIdx.
using shared memory for local tile A(32x32), B(32x32) to optimize time of visiting global memory. visit global memory costs hundreds of clocks, but shared memory in the chip only cost tens of clocks.
add padding in the shared memory to deal with bank conflict. 1 thread computation 1 element.
GPU support 128bit accelerating for vectorizaion. 128bit = float4 = double2. we can speed up arithmetic intensity by reading data simultaneously with float4. 1 thread computes 4 element.
1 block computes local tile C(64x64), load local tile A(64x16), B(16x64). 1 thread computes 4x4 local tile.
brutally combine 4 (64x64 tile) into 128x128 tile. 1 block computes local tile C(128x128), load local tile A(128x8), B(8x128). 1 thread computes 8*8 local tile.