Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor/ensure connection health for RabbitMQ message consumption #1531

Open
jmartin-sul opened this issue Oct 22, 2024 · 0 comments
Open

monitor/ensure connection health for RabbitMQ message consumption #1531

jmartin-sul opened this issue Oct 22, 2024 · 0 comments

Comments

@jmartin-sul
Copy link
Member

NOTE: the solution for this will likely look very similar to the solution for sul-dlss/dor-services-app#5190, so whoever picks this ticket up should probably either grab that other one also, or mark that one as blocked. Then we should try to use the same solution in both codebases for consistency Also similar to sul-dlss/happy-heron#3629.

Today (see Slack conversation) @andrewjbtw noticed that pre-assembly had lost its ability to consume messages from RabbitMQ, leading to loss of ability to detect successful item deposits. This meant that "the Preassembly UI makes it look like all recent jobs are still running even when they’re all completed."

I went to the pre-assembly stage and QA VMs and ran the rabbitmq:setup rake task, which I suspect will fix the issue, given what we saw with H2 and DSA last week (see H2 and DSA issues linked at top of description). But I've asked @andrewjbtw to re-test and confirm when he gets a chance.

We suspect that this might've been an artifact of the wave of VM reboots that were necessary to restore functionality after a storage outage hit most of our infrastructure on Thu 2024-10-10.

Two questions:

  1. Is there a way to proactively monitor the connection health? For the opposite direction in pres cat, we have an okcomputer check that monitors RabbitMQ connection health. This situation is slightly different since we want to monitor the ability to receive messages, not to send them. But pre-assembly is a web app, so whatever the check looks like, it could still be done via okcomputer, if we want.
  2. Is there a way to proactively ensure connection health? Since we only seem to run into this once in a blue moon, possibly only after events like system reboots (we haven't tried to reproduce the issue yet), maybe this isn't worth the effort. One idea would be to e.g. run the rabbitmq:setup rake task on deploy, but the author of this ticket is unsure if that might have undesirable side effects. E.g. is that Rake task non-destructive? Would it possibly drop in flight messages if that Rake task were run while the connection was fine?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ready (Ordered by Priority)
Development

No branches or pull requests

1 participant