You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NOTE: the solution for this will likely look very similar to the solution for sul-dlss/happy-heron#3629, so whoever picks this ticket up should probably either grab that other one also, or mark that one as blocked. Then we should try to use the same solution in both codebases for consistency.
Yesterday (see Slack thread) I noticed that dor-services-app on QA and stage apparently lost its ability to consume messages from RabbitMQ, leading to loss of ability to detect reindexing requests and entries for the DSA event log. This caused the integration tests to fail, because they looked for a status of v2 Accessioned on test objects, but despite the workflows completing successfully, that didn't replace v2 Opened until a manual reindex was performed on the object. Additionally, despite preservationIngestWF completing, and despite observing that the pres cat workers ran replication jobs, cloud replication events didn't appear in the object event log.
Going to the DSA worker VM and running the rabbitmq:setup rake task fixed things. Indexing events and event log submissions were detected after that, and integration tests passed.
We suspect that this might've been an artifact of the wave of VM reboots that were necessary to restore functionality after a storage outage hit most of our infrastructure.
Two questions:
Is there a way to proactively monitor the connection health? For the opposite direction in pres cat, we have an okcomputer check that monitors RabbitMQ connection health. This situation is slightly different since we want to monitor the ability to receive messages, not to send them. But dor-services-app is a web app, so whatever the check looks like, it could still be done via okcomputer, if we want.
Is there a way to proactively ensure connection health? Since we only seem to run into this once in a blue moon, possibly only after events like system reboots (we haven't tried to reproduce the issue yet), maybe this isn't worth the effort. One idea would be to e.g. run the rabbitmq:setup rake task on deploy, but the author of this ticket is unsure if that might have undesirable side effects. E.g. is that Rake task non-destructive? Would it possibly drop in flight messages if that Rake task were run while the connection was fine?
The text was updated successfully, but these errors were encountered:
NOTE: the solution for this will likely look very similar to the solution for sul-dlss/happy-heron#3629, so whoever picks this ticket up should probably either grab that other one also, or mark that one as blocked. Then we should try to use the same solution in both codebases for consistency.
Yesterday (see Slack thread) I noticed that dor-services-app on QA and stage apparently lost its ability to consume messages from RabbitMQ, leading to loss of ability to detect reindexing requests and entries for the DSA event log. This caused the integration tests to fail, because they looked for a status of
v2 Accessioned
on test objects, but despite the workflows completing successfully, that didn't replacev2 Opened
until a manual reindex was performed on the object. Additionally, despitepreservationIngestWF
completing, and despite observing that the pres cat workers ran replication jobs, cloud replication events didn't appear in the object event log.Going to the DSA worker VM and running the
rabbitmq:setup
rake task fixed things. Indexing events and event log submissions were detected after that, and integration tests passed.We suspect that this might've been an artifact of the wave of VM reboots that were necessary to restore functionality after a storage outage hit most of our infrastructure.
Two questions:
rabbitmq:setup
rake task on deploy, but the author of this ticket is unsure if that might have undesirable side effects. E.g. is that Rake task non-destructive? Would it possibly drop in flight messages if that Rake task were run while the connection was fine?The text was updated successfully, but these errors were encountered: