Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(memblock): opt memset pattern #3632

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

happy-lx
Copy link
Collaborator

@happy-lx happy-lx commented Sep 23, 2024

This PR optimizes the MemSet memory access mode and solves some Sbuffer performance bugs. The specific changes are as follows:

  • The MemSet detection method is changed to: detect whether the delta address and data hash remain constant according to the Store instruction commit order, and the loadQueue is empty.
  • Add an ASP (Accurate Store Prefetcher) prefetcher that works specifically in MemSet mode. It uses different distances according to store instructions of different sizes. The prefetching granularity is currently 1KB.
    • This prefetcher can also relax some restrictions (the data hash remains consistent && loadQueue is empty) to prefetch Store under MemCpy to improve performance. However, in order to prevent other negative impacts, it is only allowed to work under MemSet for now.
  • In order to improve the utilization of Sbuffer under MemSet, try to make the Sbuffer entry completely filled before sending it to Dcache. This will also help Dcache send AcquirePerm instead of AcquireData downwards so that reduce the bus bandwidth.
  • Solve the performance problem of sbuffer enqueuing: set enq.ready when merging is possible.

Previously, sbuffer was only ready when there were empty items.
In a scenario where there are no empty items but requests received from sq can be merged,
sbuffer will refuse to receive requests from sq, which will result in failure to run at full throughput.
If a memset is detected, let each newly allocated sbuffer entry wait for 32 cycles
before writing to the dcache.(When memset and the write bandwidth are full,
at least two sbs are executed in each cycle and 2 bytes are written.
It takes 32 cycles to fill a cacheline)

This will help improve sbuffer utilization
only works in MemSet Pattern
@linjuanZ linjuanZ added the do not merge Do not merge this pull request label Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge Do not merge this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants