-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make better use of VPM #113
Comments
Use VPM as real cache across all work-items: Example code:
We know (as programmer): We could rewrite:
Outstanding issues:
|
Some updates on this: No mutex lock is required to access VPM!
Within the next days I will post a PR which changes a few thing:
Performance data will come when the changes are ready... |
* dumps layout of used VPM per kernel * rewrites Emulator to handle VPM configuration per QPU * fixes bug in eliminaion of bit operations * fixes bug mapping IR operations to machine code * fixed bug mapping volatile parameters to read-only parameters * Emulator now tracks TMU read per TMU See #113
Memory access is now mapped in following steps: * Determine prefered and fall-back lowering type per memory area * Check whether lowering type can be applied, reserve resources * Map all memory access to specified lowering level Also disables combining of VPM/DMA writes/reads for now. See #113 Effects (test-emulator, last 2 commits): Instructions: 45160 to 45779 (+1%) Cycles: 659247 to 661193 (+0.2%) Mutex waits: 282551 to 281459 (-0.3%)
This changes allows us to remove mutex locks from "direct" memory access. See #113
* dumps layout of used VPM per kernel * rewrites Emulator to handle VPM configuration per QPU * fixes bug in eliminaion of bit operations * fixes bug mapping IR operations to machine code * fixed bug mapping volatile parameters to read-only parameters * Emulator now tracks TMU read per TMU See #113
Memory access is now mapped in following steps: * Determine prefered and fall-back lowering type per memory area * Check whether lowering type can be applied, reserve resources * Map all memory access to specified lowering level Also disables combining of VPM/DMA writes/reads for now. See #113 Effects (test-emulator, last 2 commits): Instructions: 45160 to 45779 (+1%) Cycles: 659247 to 661193 (+0.2%) Mutex waits: 282551 to 281459 (-0.3%)
This changes allows us to remove mutex locks from "direct" memory access. See #113
This version will only combine writing of same setup values, where possible. The full version is also removed, since it will anyway become obsolete with VPM cached memory (see #113). Effects (test-emulator): Instructions: 52511 to 49793 (-5%) Cycles: 644891 to 641680 (-0.5%) Total time (in ms): 62869 to 58456 (-7%)
Linking #86 to be checked for performance/correctness and close. |
* dumps layout of used VPM per kernel * rewrites Emulator to handle VPM configuration per QPU * fixes bug in eliminaion of bit operations * fixes bug mapping IR operations to machine code * fixed bug mapping volatile parameters to read-only parameters * Emulator now tracks TMU read per TMU See #113
Memory access is now mapped in following steps: * Determine prefered and fall-back lowering type per memory area * Check whether lowering type can be applied, reserve resources * Map all memory access to specified lowering level Also disables combining of VPM/DMA writes/reads for now. See #113 Effects (test-emulator, last 2 commits): Instructions: 45160 to 45779 (+1%) Cycles: 659247 to 661193 (+0.2%) Mutex waits: 282551 to 281459 (-0.3%)
This changes allows us to remove mutex locks from "direct" memory access. See #113
This version will only combine writing of same setup values, where possible. The full version is also removed, since it will anyway become obsolete with VPM cached memory (see #113). Effects (test-emulator): Instructions: 52511 to 49793 (-5%) Cycles: 644891 to 641680 (-0.5%) Total time (in ms): 62869 to 58456 (-7%)
* dumps layout of used VPM per kernel * rewrites Emulator to handle VPM configuration per QPU * fixes bug in eliminaion of bit operations * fixes bug mapping IR operations to machine code * fixed bug mapping volatile parameters to read-only parameters * Emulator now tracks TMU read per TMU See #113
Memory access is now mapped in following steps: * Determine prefered and fall-back lowering type per memory area * Check whether lowering type can be applied, reserve resources * Map all memory access to specified lowering level Also disables combining of VPM/DMA writes/reads for now. See #113 Effects (test-emulator, last 2 commits): Instructions: 45160 to 45779 (+1%) Cycles: 659247 to 661193 (+0.2%) Mutex waits: 282551 to 281459 (-0.3%)
This changes allows us to remove mutex locks from "direct" memory access. See #113
This version will only combine writing of same setup values, where possible. The full version is also removed, since it will anyway become obsolete with VPM cached memory (see #113). Effects (test-emulator): Instructions: 52511 to 49793 (-5%) Cycles: 644891 to 641680 (-0.5%) Total time (in ms): 62869 to 58456 (-7%)
Up to 2 VPM writes can be queued to VPM write FIFO (QPU -> VPM), write will block when FIFO full.
-> No need to stall/delay between VPM writes, currently used
-> Information could be used to insert non-VPM-access between pairs of VPM writes (e.g.
write vpm; write vpm; something else to prevent stall; write vpm; ...
)Up to 2 VPM read setups can be queued to VPM read FIFO (VPM -> QPU), further writes to setup register will be ignored, outstanding VPM reads on program finish are cancelled.
-> We could queue up to 2 read setups before waiting for data to be available. Also, for loops, we could issue the read setup for the next iteration in advance, this needs emptying of data after loop ends (to empty the data read for the one-after-last iteration).
DMA load/store operations cannot be queued, but DMA load and DMA store can run concurrently.
Is VPM access required to be synchronized between all QPUs?
There is no statement in the specification to (or against) that fact. Is the VPM really shared (as in locking required) or is it "shared" but can still be used by every QPU at once (like the TMU, no locking required)?
https://github.com/nineties/py-videocore uses mutex to lock VPM access in parallel examples, https://github.com/mn416/QPULib does not seem to use a mutex, https://github.com/maazl/vc4asm uses semaphores to lock VPM access.
Sources:
VideoCore IV Specification, pages 55+
The text was updated successfully, but these errors were encountered: