Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) #2049

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pbudzon
Copy link

@pbudzon pbudzon commented Jul 1, 2024

Fix for: #1835
Partially also: #1819
Possibly related: celery/celery#8583

Issue
when not using short polling (wait_time_seconds > 0) and task_acks_late=True it is possible for message to become stuck with full visibility timeout when celery is shut down. I was able to replicate that consistently with cold shutdowns.

Similar issue happens with warm shutdowns (see related discussion/bugs above). And with task_acks_late=False I believe this causes the message to become lost completely.

To reproduce

  1. Execute a long(ish) running job (long enough for you to perform those steps).
  2. Send SIGQUIT to celery master process.
  3. See Restoring X unacknowledged message(s).
  4. See that the message is still in-flight with a full visibility timeout.

Note: this is a race condition, so it doesn't happen every time, but in my tests it happened MOST of the time (like 80%).

What happens
I believe the internal details are described in #1819.
On AWS side, you can see in CloudTrail logs the ChangeMessageVisibility call appearing correctly, but right after it there is (at least one, sometimes more, depending on your concurrency) a ReceiveMessage call which fetches the message again from the queue, causing its visibility timeout to be set back to non-zero (whatever your settings are).

Screenshot 2024-07-01 at 19 52 45

This fix
This change changes the visibility timeout of the redelivered message from 0 (zero) to wait_time_seconds value. While this doesn't solve the underlying issue, it prevents the message from being re-fetched by the still-running ReceiveMessage call(s).
For most scenarios this should create a considerable improvement, because:

  1. With task_acks_late=True, the message will be hidden for wait_time_seconds (default 10 seconds), instead of visibility timeout (default 30 minutes). I have tested that change in this scenario and it is working as expected.
  2. With task_acks_late=False, this should prevent the message from being lost completely. I did not test that scenario, however.

kombu/transport/SQS.py Show resolved Hide resolved
@pbudzon
Copy link
Author

pbudzon commented Jul 2, 2024

Issue also reported here: celery/celery#7283 back in 2022.

And celery/celery#8875

@pbudzon pbudzon changed the title SQS: when redelivering a message apply visibility timeout based on polling time SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) Jul 2, 2024
@pbudzon pbudzon requested a review from auvipy July 3, 2024 12:56
@pbudzon
Copy link
Author

pbudzon commented Jul 5, 2024

Do you need anything else to merge this?

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also consider updating related documents for this?

@Nusnus Nusnus self-requested a review August 18, 2024 11:19
@Nusnus
Copy link
Member

Nusnus commented Aug 30, 2024

@Nusnus Can you please take a look?

I haven’t fully reviewed this PR yet, but I offer an alternative approach at celery/celery#9213.

@pbudzon check it out, it might solve your original problem (still using 10s).

@pbudzon
Copy link
Author

pbudzon commented Aug 30, 2024

@Nusnus I'll see if we can upgrade internally to celery 5.5 and if this new approach solves the problem.

The background for this specific issue, as far as I understood, was caused by a race condition inside kombu where ReceiveMessage calls are made by using an async library and happen "in the background" of everything else - and they are not properly (or quickly enough?) terminated when shutdown happens, so part of the code restoring messages to the queue on shutdown can happen while there's an active ReceiveMessage call still "running" which then immediately fetches the restored message (removing it from the queue or hiding it again, depending on task_acks_late) and then celery exists so the message is lost (or stays hidden with visibility timeout, again depending on task_acks_late).

That is my understanding based partially on what I saw in the code and on this: #1819

So, with the new shutdowns, I imagine this would still be an issue with the hard shutdown and cold shutdown. Others - I cannot say without testing. That would mainly depend on whether the async ReceiveMessage is killed fast enough, but that's controlled by kombu not celery, so I'm not sure if it's even possible to solve this from celery's side (but I don't know all the source code, so again, I don't know for sure).

What would you say is the best approach to terminating celery when we have time constraints (like in containers services like ECS, etc, where we have 1-2 minutes to exit cleanly) and long-running tasks - so we are fairly sure the tasks won't be able to finish within the time limit anyway, but we want to have absolute certainty they will end up back in the queue?
Right now, we're using Cold Shutdowns and (apart from this bug, which we've fixed with this PR applied directly in our environment) it performs as expected returning the messages to the queue correctly.

@Nusnus
Copy link
Member

Nusnus commented Aug 30, 2024

@pbudzon

@Nusnus I'll see if we can upgrade internally to celery 5.5 and if this new approach solves the problem.

🔥

The background for this specific issue, as far as I understood, was caused by a race condition inside kombu where ReceiveMessage calls are made by using an async library and happen "in the background" of everything else - and they are not properly (or quickly enough?) terminated when shutdown happens, so part of the code restoring messages to the queue on shutdown can happen while there's an active ReceiveMessage call still "running" which then immediately fetches the restored message (removing it from the queue or hiding it again, depending on task_acks_late) and then celery exists so the message is lost (or stays hidden with visibility timeout, again depending on task_acks_late).

That is my understanding based partially on what I saw in the code and on this: #1819

Basically yeah, the worker terminates too fast. It is very confusing during debugging because the debugger keeps the worker process alive, but I am able to replicate it using the smoke tests so I could test automatically.

So, with the new shutdowns, I imagine this would still be an issue with the hard shutdown and cold shutdown. Others - I cannot say without testing. That would mainly depend on whether the async ReceiveMessage is killed fast enough, but that's controlled by kombu not celery, so I'm not sure if it's even possible to solve this from celery's side (but I don't know all the source code, so again, I don't know for sure).

What would you say is the best approach to terminating celery when we have time constraints (like in containers services like ECS, etc, where we have 1-2 minutes to exit cleanly) and long-running tasks - so we are fairly sure the tasks won't be able to finish within the time limit anyway, but we want to have absolute certainty they will end up back in the queue? Right now, we're using Cold Shutdowns and (apart from this bug, which we've fixed with this PR applied directly in our environment) it performs as expected returning the messages to the queue correctly.

The approach with the soft shutdown is to allow the user to define a timer before everything collapses. 10-30 seconds should be enough. It means it will give a “limited warm shutdown” of 10-30s (for example) so the tasks will run a bit more. Another SIGQUIT will stop the tasks but still wait for the timeout to expire so the race condition should be “avoided”.

The hard shutdown is never used directly. It is the result of multiple SIGQUIT signals (or SIGINT or SIGTERM if REMAP_SIGTERM is configured). But yes, it will do the same as now, forcing termination of the worker process immediately.

@Nusnus
Copy link
Member

Nusnus commented Sep 8, 2024

@Nusnus I'll see if we can upgrade internally to celery 5.5 and if this new approach solves the problem.

Celery v5.5.0b3 released.

Copy link
Member

@Nusnus Nusnus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there @pbudzon, have you tried v5.5 beta?
Do you still need the changes of this PR or does the soft shutdown fix your use-case?

Thank!

@pbudzon
Copy link
Author

pbudzon commented Oct 7, 2024

@Nusnus apologies, we did not have a chance to test the 5.5 yet. I'm hoping we'll be able to do that in the coming weeks - I'd like to keep this PR open until we can confirm this issue has been resolved, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants