SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) #2049

pbudzon · 2024-07-01T19:44:54Z

Fix for: #1835
Partially also: #1819
Possibly related: celery/celery#8583

Issue
when not using short polling (wait_time_seconds > 0) and task_acks_late=True it is possible for message to become stuck with full visibility timeout when celery is shut down. I was able to replicate that consistently with cold shutdowns.

Similar issue happens with warm shutdowns (see related discussion/bugs above). And with task_acks_late=False I believe this causes the message to become lost completely.

To reproduce

Execute a long(ish) running job (long enough for you to perform those steps).
Send SIGQUIT to celery master process.
See Restoring X unacknowledged message(s).
See that the message is still in-flight with a full visibility timeout.

Note: this is a race condition, so it doesn't happen every time, but in my tests it happened MOST of the time (like 80%).

What happens
I believe the internal details are described in #1819.
On AWS side, you can see in CloudTrail logs the ChangeMessageVisibility call appearing correctly, but right after it there is (at least one, sometimes more, depending on your concurrency) a ReceiveMessage call which fetches the message again from the queue, causing its visibility timeout to be set back to non-zero (whatever your settings are).

This fix
This change changes the visibility timeout of the redelivered message from 0 (zero) to wait_time_seconds value. While this doesn't solve the underlying issue, it prevents the message from being re-fetched by the still-running ReceiveMessage call(s).
For most scenarios this should create a considerable improvement, because:

With task_acks_late=True, the message will be hidden for wait_time_seconds (default 10 seconds), instead of visibility timeout (default 30 minutes). I have tested that change in this scenario and it is working as expected.
With task_acks_late=False, this should prevent the message from being lost completely. I did not test that scenario, however.

kombu/transport/SQS.py

…lling wait time

pbudzon · 2024-07-02T16:10:05Z

Issue also reported here: celery/celery#7283 back in 2022.

And celery/celery#8875

pbudzon · 2024-07-05T16:37:10Z

Do you need anything else to merge this?

auvipy

should we also consider updating related documents for this?

Nusnus · 2024-08-30T13:45:58Z

@Nusnus Can you please take a look?

I haven’t fully reviewed this PR yet, but I offer an alternative approach at celery/celery#9213.

@pbudzon check it out, it might solve your original problem (still using 10s).

pbudzon · 2024-08-30T14:17:14Z

@Nusnus I'll see if we can upgrade internally to celery 5.5 and if this new approach solves the problem.

The background for this specific issue, as far as I understood, was caused by a race condition inside kombu where ReceiveMessage calls are made by using an async library and happen "in the background" of everything else - and they are not properly (or quickly enough?) terminated when shutdown happens, so part of the code restoring messages to the queue on shutdown can happen while there's an active ReceiveMessage call still "running" which then immediately fetches the restored message (removing it from the queue or hiding it again, depending on task_acks_late) and then celery exists so the message is lost (or stays hidden with visibility timeout, again depending on task_acks_late).

That is my understanding based partially on what I saw in the code and on this: #1819

So, with the new shutdowns, I imagine this would still be an issue with the hard shutdown and cold shutdown. Others - I cannot say without testing. That would mainly depend on whether the async ReceiveMessage is killed fast enough, but that's controlled by kombu not celery, so I'm not sure if it's even possible to solve this from celery's side (but I don't know all the source code, so again, I don't know for sure).

What would you say is the best approach to terminating celery when we have time constraints (like in containers services like ECS, etc, where we have 1-2 minutes to exit cleanly) and long-running tasks - so we are fairly sure the tasks won't be able to finish within the time limit anyway, but we want to have absolute certainty they will end up back in the queue?
Right now, we're using Cold Shutdowns and (apart from this bug, which we've fixed with this PR applied directly in our environment) it performs as expected returning the messages to the queue correctly.

Nusnus · 2024-08-30T14:30:16Z

@pbudzon

@Nusnus I'll see if we can upgrade internally to celery 5.5 and if this new approach solves the problem.

🔥

The background for this specific issue, as far as I understood, was caused by a race condition inside kombu where ReceiveMessage calls are made by using an async library and happen "in the background" of everything else - and they are not properly (or quickly enough?) terminated when shutdown happens, so part of the code restoring messages to the queue on shutdown can happen while there's an active ReceiveMessage call still "running" which then immediately fetches the restored message (removing it from the queue or hiding it again, depending on task_acks_late) and then celery exists so the message is lost (or stays hidden with visibility timeout, again depending on task_acks_late).

That is my understanding based partially on what I saw in the code and on this: #1819

Basically yeah, the worker terminates too fast. It is very confusing during debugging because the debugger keeps the worker process alive, but I am able to replicate it using the smoke tests so I could test automatically.

So, with the new shutdowns, I imagine this would still be an issue with the hard shutdown and cold shutdown. Others - I cannot say without testing. That would mainly depend on whether the async ReceiveMessage is killed fast enough, but that's controlled by kombu not celery, so I'm not sure if it's even possible to solve this from celery's side (but I don't know all the source code, so again, I don't know for sure).

What would you say is the best approach to terminating celery when we have time constraints (like in containers services like ECS, etc, where we have 1-2 minutes to exit cleanly) and long-running tasks - so we are fairly sure the tasks won't be able to finish within the time limit anyway, but we want to have absolute certainty they will end up back in the queue? Right now, we're using Cold Shutdowns and (apart from this bug, which we've fixed with this PR applied directly in our environment) it performs as expected returning the messages to the queue correctly.

The approach with the soft shutdown is to allow the user to define a timer before everything collapses. 10-30 seconds should be enough. It means it will give a “limited warm shutdown” of 10-30s (for example) so the tasks will run a bit more. Another SIGQUIT will stop the tasks but still wait for the timeout to expire so the race condition should be “avoided”.

The hard shutdown is never used directly. It is the result of multiple SIGQUIT signals (or SIGINT or SIGTERM if REMAP_SIGTERM is configured). But yes, it will do the same as now, forcing termination of the worker process immediately.

Nusnus · 2024-09-08T13:44:14Z

@Nusnus I'll see if we can upgrade internally to celery 5.5 and if this new approach solves the problem.

Celery v5.5.0b3 released.

Nusnus

Hey there @pbudzon, have you tried v5.5 beta?
Do you still need the changes of this PR or does the soft shutdown fix your use-case?

Thank!

pbudzon · 2024-10-07T16:05:59Z

@Nusnus apologies, we did not have a chance to test the 5.5 yet. I'm hoping we'll be able to do that in the coming weeks - I'd like to keep this PR open until we can confirm this issue has been resolved, please.

auvipy requested changes Jul 2, 2024

View reviewed changes

kombu/transport/SQS.py Show resolved Hide resolved

SQS: when redelivering a message apply visibility timeout based on po…

9add748

…lling wait time

pbudzon force-pushed the main branch from 25af503 to 22d1adb Compare July 2, 2024 08:29

SQS: redevelivered - updated unit tests

2278980

pbudzon force-pushed the main branch from 22d1adb to 2278980 Compare July 2, 2024 08:30

pbudzon mentioned this pull request Jul 2, 2024

Celery with SQS - Unrestored task celery/celery#6242

Closed

18 tasks

pbudzon changed the title ~~SQS: when redelivering a message apply visibility timeout based on polling time~~ SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) Jul 2, 2024

pbudzon requested a review from auvipy July 3, 2024 12:56

auvipy reviewed Jul 8, 2024

View reviewed changes

Nusnus self-requested a review August 18, 2024 11:19

Nusnus reviewed Sep 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) #2049

SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) #2049

pbudzon commented Jul 1, 2024

pbudzon commented Jul 2, 2024 •

edited

Loading

pbudzon commented Jul 5, 2024

auvipy left a comment

Nusnus commented Aug 30, 2024

pbudzon commented Aug 30, 2024 •

edited

Loading

Nusnus commented Aug 30, 2024 •

edited

Loading

Nusnus commented Sep 8, 2024

Nusnus left a comment •

edited

Loading

pbudzon commented Oct 7, 2024

SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) #2049

Are you sure you want to change the base?

SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) #2049

Conversation

pbudzon commented Jul 1, 2024

pbudzon commented Jul 2, 2024 • edited Loading

pbudzon commented Jul 5, 2024

auvipy left a comment

Choose a reason for hiding this comment

Nusnus commented Aug 30, 2024

pbudzon commented Aug 30, 2024 • edited Loading

Nusnus commented Aug 30, 2024 • edited Loading

Nusnus commented Sep 8, 2024

Nusnus left a comment • edited Loading

Choose a reason for hiding this comment

pbudzon commented Oct 7, 2024

pbudzon commented Jul 2, 2024 •

edited

Loading

pbudzon commented Aug 30, 2024 •

edited

Loading

Nusnus commented Aug 30, 2024 •

edited

Loading

Nusnus left a comment •

edited

Loading