Make better use of VPM #113

doe300 · 2018-07-24T18:05:18Z

Up to 2 VPM writes can be queued to VPM write FIFO (QPU -> VPM), write will block when FIFO full.
-> No need to stall/delay between VPM writes, currently used
-> Information could be used to insert non-VPM-access between pairs of VPM writes (e.g. write vpm; write vpm; something else to prevent stall; write vpm; ...)

Up to 2 VPM read setups can be queued to VPM read FIFO (VPM -> QPU), further writes to setup register will be ignored, outstanding VPM reads on program finish are cancelled.
-> We could queue up to 2 read setups before waiting for data to be available. Also, for loops, we could issue the read setup for the next iteration in advance, this needs emptying of data after loop ends (to empty the data read for the one-after-last iteration).

DMA load/store operations cannot be queued, but DMA load and DMA store can run concurrently.

Is VPM access required to be synchronized between all QPUs?
There is no statement in the specification to (or against) that fact. Is the VPM really shared (as in locking required) or is it "shared" but can still be used by every QPU at once (like the TMU, no locking required)?

https://github.com/nineties/py-videocore uses mutex to lock VPM access in parallel examples, https://github.com/mn416/QPULib does not seem to use a mutex, https://github.com/maazl/vc4asm uses semaphores to lock VPM access.

Sources:
VideoCore IV Specification, pages 55+

The text was updated successfully, but these errors were encountered:

doe300 · 2018-10-06T08:45:23Z

Use VPM as real cache across all work-items:

Example code:

unsigned gid = get_global_id(0);
a = in[gid];
// do some operations
out[gid] = x;

We know (as programmer):
All global_ids(0)/local within a work-group are in a closed range (e.g. [17, 27]), therefore the memory accessed is consecutive. Same works for local_ids(0)

We could rewrite:

// for local id 0
<prefetch get_local_size(0) items from in + gid>
a = <read from VPM>;
// do some operations
<write into VPM> = x;
<wait for other work-items to have written everything into VPM>
<write back get_local_size(0) items to out + gid>

// for other work-items
<wait for local id 0 to have cached the values into VPM>
a = <read from VPM>;
// do some operations
<write into VPM> = x;

Outstanding issues:

Wait for other work-items to have finished loading/writing VPM
Detecting of where this optimization can be performed
Making sure, VPM area is not reused in between

doe300 · 2018-11-01T12:58:44Z

Some updates on this:

No mutex lock is required to access VPM!
Turns out the VPM/DMA behaves more like a "normal" CPU cache then expected. The VPM/DMA configuration is per QPU and therefore QPUs cannot override the setups for other QPUs
Summary: Other then the "normal" race conditions common to all hardware when writing the same addresses (VPM or RAM), VPM/DMA access can be executed completely in parallel (maybe the hardware stalls some access until other is finished, but this is out of our control anyway).

~~Also, because of HW bug HW-2253, QPUs can only access the first 64 rows of the VPM (for user programs), but this is not true for DMA (which can access 128 rows).~~

Within the next days I will post a PR which changes a few thing:

Removes mutex lock from most VPM/DMA access (except scratch area)
Rewrites scratch area to be per-QPU so mutex can be removed there too
Completely rewrites mapping of memory access (again), to have more consistent checks which memory can be lowered to where (register, VPM)
Adds work-group shared caching of memory areas with initial load and final write-back (see previous post)
~~Uses the "hidden" (as in not accessible from QPU) part of the VPM for RAM-to-RAM copies.~~
~~Uses separate areas of the "hidden" part of the VPM per QPU for copying RAM-to-RAM to remove need for mutex lock.~~ Use per-QPU scratch areas for RAM-to-RAM copy

Performance data will come when the changes are ready...

* dumps layout of used VPM per kernel * rewrites Emulator to handle VPM configuration per QPU * fixes bug in eliminaion of bit operations * fixes bug mapping IR operations to machine code * fixed bug mapping volatile parameters to read-only parameters * Emulator now tracks TMU read per TMU See #113

Memory access is now mapped in following steps: * Determine prefered and fall-back lowering type per memory area * Check whether lowering type can be applied, reserve resources * Map all memory access to specified lowering level Also disables combining of VPM/DMA writes/reads for now. See #113 Effects (test-emulator, last 2 commits): Instructions: 45160 to 45779 (+1%) Cycles: 659247 to 661193 (+0.2%) Mutex waits: 282551 to 281459 (-0.3%)

This changes allows us to remove mutex locks from "direct" memory access. See #113

* dumps layout of used VPM per kernel * rewrites Emulator to handle VPM configuration per QPU * fixes bug in eliminaion of bit operations * fixes bug mapping IR operations to machine code * fixed bug mapping volatile parameters to read-only parameters * Emulator now tracks TMU read per TMU See #113

Memory access is now mapped in following steps: * Determine prefered and fall-back lowering type per memory area * Check whether lowering type can be applied, reserve resources * Map all memory access to specified lowering level Also disables combining of VPM/DMA writes/reads for now. See #113 Effects (test-emulator, last 2 commits): Instructions: 45160 to 45779 (+1%) Cycles: 659247 to 661193 (+0.2%) Mutex waits: 282551 to 281459 (-0.3%)

This changes allows us to remove mutex locks from "direct" memory access. See #113

This version will only combine writing of same setup values, where possible. The full version is also removed, since it will anyway become obsolete with VPM cached memory (see #113). Effects (test-emulator): Instructions: 52511 to 49793 (-5%) Cycles: 644891 to 641680 (-0.5%) Total time (in ms): 62869 to 58456 (-7%)

doe300 · 2018-11-12T18:00:00Z

Linking #86 to be checked for performance/correctness and close.