Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make better use of VPM #113

Open
doe300 opened this issue Jul 24, 2018 · 3 comments
Open

Make better use of VPM #113

doe300 opened this issue Jul 24, 2018 · 3 comments
Labels
enhancement help wanted optimization related to an optimization step

Comments

@doe300
Copy link
Owner

doe300 commented Jul 24, 2018

Up to 2 VPM writes can be queued to VPM write FIFO (QPU -> VPM), write will block when FIFO full.
-> No need to stall/delay between VPM writes, currently used
-> Information could be used to insert non-VPM-access between pairs of VPM writes (e.g. write vpm; write vpm; something else to prevent stall; write vpm; ...)

Up to 2 VPM read setups can be queued to VPM read FIFO (VPM -> QPU), further writes to setup register will be ignored, outstanding VPM reads on program finish are cancelled.
-> We could queue up to 2 read setups before waiting for data to be available. Also, for loops, we could issue the read setup for the next iteration in advance, this needs emptying of data after loop ends (to empty the data read for the one-after-last iteration).

DMA load/store operations cannot be queued, but DMA load and DMA store can run concurrently.

Is VPM access required to be synchronized between all QPUs?
There is no statement in the specification to (or against) that fact. Is the VPM really shared (as in locking required) or is it "shared" but can still be used by every QPU at once (like the TMU, no locking required)?

https://github.com/nineties/py-videocore uses mutex to lock VPM access in parallel examples, https://github.com/mn416/QPULib does not seem to use a mutex, https://github.com/maazl/vc4asm uses semaphores to lock VPM access.

Sources:
VideoCore IV Specification, pages 55+

@doe300 doe300 added enhancement help wanted optimization related to an optimization step labels Jul 24, 2018
@doe300
Copy link
Owner Author

doe300 commented Oct 6, 2018

Use VPM as real cache across all work-items:

Example code:

unsigned gid = get_global_id(0);
a = in[gid];
// do some operations
out[gid] = x;

We know (as programmer):
All global_ids(0)/local within a work-group are in a closed range (e.g. [17, 27]), therefore the memory accessed is consecutive. Same works for local_ids(0)

We could rewrite:

// for local id 0
<prefetch get_local_size(0) items from in + gid>
a = <read from VPM>;
// do some operations
<write into VPM> = x;
<wait for other work-items to have written everything into VPM>
<write back get_local_size(0) items to out + gid>

// for other work-items
<wait for local id 0 to have cached the values into VPM>
a = <read from VPM>;
// do some operations
<write into VPM> = x;

Outstanding issues:

  • Wait for other work-items to have finished loading/writing VPM
  • Detecting of where this optimization can be performed
  • Making sure, VPM area is not reused in between

@doe300
Copy link
Owner Author

doe300 commented Nov 1, 2018

Some updates on this:

No mutex lock is required to access VPM!
Turns out the VPM/DMA behaves more like a "normal" CPU cache then expected. The VPM/DMA configuration is per QPU and therefore QPUs cannot override the setups for other QPUs
Summary: Other then the "normal" race conditions common to all hardware when writing the same addresses (VPM or RAM), VPM/DMA access can be executed completely in parallel (maybe the hardware stalls some access until other is finished, but this is out of our control anyway).

Also, because of HW bug HW-2253, QPUs can only access the first 64 rows of the VPM (for user programs), but this is not true for DMA (which can access 128 rows).

Within the next days I will post a PR which changes a few thing:

  • Removes mutex lock from most VPM/DMA access (except scratch area)
  • Rewrites scratch area to be per-QPU so mutex can be removed there too
  • Completely rewrites mapping of memory access (again), to have more consistent checks which memory can be lowered to where (register, VPM)
  • Adds work-group shared caching of memory areas with initial load and final write-back (see previous post)
  • Uses the "hidden" (as in not accessible from QPU) part of the VPM for RAM-to-RAM copies.
  • Uses separate areas of the "hidden" part of the VPM per QPU for copying RAM-to-RAM to remove need for mutex lock. Use per-QPU scratch areas for RAM-to-RAM copy

Performance data will come when the changes are ready...

doe300 added a commit that referenced this issue Nov 3, 2018
* dumps layout of used VPM per kernel
* rewrites Emulator to handle VPM configuration per QPU
* fixes bug in eliminaion of bit operations
* fixes bug mapping IR operations to machine code
* fixed bug mapping volatile parameters to read-only parameters
* Emulator now tracks TMU read per TMU

See #113
doe300 added a commit that referenced this issue Nov 3, 2018
Memory access is now mapped in following steps:

* Determine prefered and fall-back lowering type per memory area
* Check whether lowering type can be applied, reserve resources
* Map all memory access to specified lowering level

Also disables combining of VPM/DMA writes/reads for now.

See #113

Effects (test-emulator, last 2 commits):
Instructions: 45160 to 45779 (+1%)
Cycles:       659247 to 661193 (+0.2%)
Mutex waits:  282551 to 281459 (-0.3%)
doe300 added a commit that referenced this issue Nov 3, 2018
This changes allows us to remove mutex locks from "direct" memory access.

See #113
doe300 added a commit that referenced this issue Nov 10, 2018
* dumps layout of used VPM per kernel
* rewrites Emulator to handle VPM configuration per QPU
* fixes bug in eliminaion of bit operations
* fixes bug mapping IR operations to machine code
* fixed bug mapping volatile parameters to read-only parameters
* Emulator now tracks TMU read per TMU

See #113
doe300 added a commit that referenced this issue Nov 10, 2018
Memory access is now mapped in following steps:

* Determine prefered and fall-back lowering type per memory area
* Check whether lowering type can be applied, reserve resources
* Map all memory access to specified lowering level

Also disables combining of VPM/DMA writes/reads for now.

See #113

Effects (test-emulator, last 2 commits):
Instructions: 45160 to 45779 (+1%)
Cycles:       659247 to 661193 (+0.2%)
Mutex waits:  282551 to 281459 (-0.3%)
doe300 added a commit that referenced this issue Nov 10, 2018
This changes allows us to remove mutex locks from "direct" memory access.

See #113
doe300 added a commit that referenced this issue Nov 10, 2018
This version will only combine writing of same setup values,
where possible. The full version is also removed, since it will
anyway become obsolete with VPM cached memory (see #113).

Effects (test-emulator):
Instructions:       52511 to 49793 (-5%)
Cycles:             644891 to 641680 (-0.5%)
Total time (in ms): 62869 to 58456 (-7%)
@doe300
Copy link
Owner Author

doe300 commented Nov 12, 2018

Linking #86 to be checked for performance/correctness and close.

doe300 added a commit that referenced this issue Dec 20, 2018
* dumps layout of used VPM per kernel
* rewrites Emulator to handle VPM configuration per QPU
* fixes bug in eliminaion of bit operations
* fixes bug mapping IR operations to machine code
* fixed bug mapping volatile parameters to read-only parameters
* Emulator now tracks TMU read per TMU

See #113
doe300 added a commit that referenced this issue Dec 20, 2018
Memory access is now mapped in following steps:

* Determine prefered and fall-back lowering type per memory area
* Check whether lowering type can be applied, reserve resources
* Map all memory access to specified lowering level

Also disables combining of VPM/DMA writes/reads for now.

See #113

Effects (test-emulator, last 2 commits):
Instructions: 45160 to 45779 (+1%)
Cycles:       659247 to 661193 (+0.2%)
Mutex waits:  282551 to 281459 (-0.3%)
doe300 added a commit that referenced this issue Dec 20, 2018
This changes allows us to remove mutex locks from "direct" memory access.

See #113
doe300 added a commit that referenced this issue Dec 20, 2018
This version will only combine writing of same setup values,
where possible. The full version is also removed, since it will
anyway become obsolete with VPM cached memory (see #113).

Effects (test-emulator):
Instructions:       52511 to 49793 (-5%)
Cycles:             644891 to 641680 (-0.5%)
Total time (in ms): 62869 to 58456 (-7%)
doe300 added a commit that referenced this issue Dec 21, 2018
* dumps layout of used VPM per kernel
* rewrites Emulator to handle VPM configuration per QPU
* fixes bug in eliminaion of bit operations
* fixes bug mapping IR operations to machine code
* fixed bug mapping volatile parameters to read-only parameters
* Emulator now tracks TMU read per TMU

See #113
doe300 added a commit that referenced this issue Dec 21, 2018
Memory access is now mapped in following steps:

* Determine prefered and fall-back lowering type per memory area
* Check whether lowering type can be applied, reserve resources
* Map all memory access to specified lowering level

Also disables combining of VPM/DMA writes/reads for now.

See #113

Effects (test-emulator, last 2 commits):
Instructions: 45160 to 45779 (+1%)
Cycles:       659247 to 661193 (+0.2%)
Mutex waits:  282551 to 281459 (-0.3%)
doe300 added a commit that referenced this issue Dec 21, 2018
This changes allows us to remove mutex locks from "direct" memory access.

See #113
doe300 added a commit that referenced this issue Dec 21, 2018
This version will only combine writing of same setup values,
where possible. The full version is also removed, since it will
anyway become obsolete with VPM cached memory (see #113).

Effects (test-emulator):
Instructions:       52511 to 49793 (-5%)
Cycles:             644891 to 641680 (-0.5%)
Total time (in ms): 62869 to 58456 (-7%)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted optimization related to an optimization step
Projects
None yet
Development

No branches or pull requests

1 participant