Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvmeof/NVMeofGwMonitorClient: use a separate mutex for beacons #59053

Merged
merged 1 commit into from
Aug 25, 2024

Conversation

baum
Copy link
Contributor

@baum baum commented Aug 6, 2024

nvmeof/NVMeofGwMonitorClient: use a separate mutex for beacons

Add beacon_lock to mitigate potential beacon delays caused by slow message handling, particularly in handle_nvmeof_gw_map.

CI Run: ceph/ceph-nvmeof#784

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@baum baum requested a review from athanatos August 14, 2024 06:34
@athanatos
Copy link
Contributor

athanatos commented Aug 15, 2024

I'm much less familiar with the NVMeofGwMonitorClient side, but this looks reasonable to me at a surface level. I do have some questions/comments below that are probably worth looking at.

First, a boring nit: include the subdirectory in commit message:
"NVMeofGwMonitorClient: use a separate mutex for beacons" -> "nvmeof/NVMeofGwMonitorClient: use a separate mutex for beacons"

Next, from a read of NVMeofGwMonitorClient.cc, I suppose that you are worried that slow handling of messages is causing beacons to lag -- specifically that handle_nvmeof_gw_map is taking a long time. Is that the case? If so, I suggest that you include a comment on beacon_lock explaining that reasoning. Something like

/// allow beacons to be sent independently of handle_nvmeof_gw_map
ceph::mutex beacon_lock = ceph::make_mutex("NVMeofGw::beacon_lock");

My larger concern is that it's not clear to me what exactly about handle_nvmeof_gw_map is taking a long time. At a guess, it's the (synchronous?) grpc call to the reactor? If it's actually the case that that is hanging long enough for the mon to notice the gap, is the GW really healthy? Suppose the spdk reactor becomes unhealthy enough that no grpc calls go through, it seems like you'd really want the GW to be declared dead and failed over.

The above doesn't mean that this change is incorrect as I could well be missing something -- I just want to point out that you do probably want to make sure that the arrival of a beacon at the monitor actually demonstrates that the gw can actually serve client IO, and not just that the tick thread is still running.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adjusting the commit message and adding a short explanatory comment to the new mutex as I outline above, but this looks fine at a surface level.

It's worth taking a look at my comment immediately above, however, as I'm a bit worried that this will have an effect of allowing a non-functional gw to continue sending beacons (though I could very easily be missing some other mechanism that would prevent that).

@baum baum force-pushed the wip-baum-20240806-00 branch 2 times, most recently from 4ed4cdc to 0802f51 Compare August 15, 2024 13:03
@baum baum changed the title NVMeofGwMonitorClient: use a separate mutex for beacons nvmeof/NVMeofGwMonitorClient: use a separate mutex for beacons Aug 15, 2024
@baum
Copy link
Contributor Author

baum commented Aug 25, 2024

jenkins retest this please

Add beacon_lock to mitigate potential beacon delays caused by slow message
handling, particularly in handle_nvmeof_gw_map.

Signed-off-by: Alexander Indenbaum <[email protected]>
@baum baum merged commit 8ae876c into ceph:main Aug 25, 2024
9 of 11 checks passed
pritha-srivastava pushed a commit to pritha-srivastava/ceph that referenced this pull request Aug 28, 2024
nvmeof/NVMeofGwMonitorClient: use a separate mutex for beacons
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants