Job not logged in failed_jobs if timeout occurs within database transaction #49389

graemlourens · 2023-12-15T09:42:02Z

Laravel Version

10.37.3

PHP Version

8.2.7

Database Driver & Version

MariaDB 10.3.27

Description

We had a 'phantom' problem that jobs that failed, were not logged to the failed_jobs table. After some investigation we found out that by mistake we were doing a heavy process within a db transaction which lead the specific job to timeout.

After testing we found out that the issue only happens if the jobs timeout during an open db transaction

I reported this in laravel/horizon but it turns out it has nothing to do with horizon but seems to be a framework issue.

Steps To Reproduce

Create a job with a timeout of 5 seconds
In the job handle method sleep within a db transaction

    public function handle(): void
    {
        DB::transaction(function (): void {
            sleep(10);
        });
    }

We would expect a job in the failed_jobs table, however this is not the case, and the JobFailed event is also not dispatched

The text was updated successfully, but these errors were encountered:

Verify #49389 Signed-off-by: Mior Muhammad Zaki <[email protected]>

crynobone · 2023-12-19T06:40:15Z

HI there,

I added the following tests to verify the issue and was unable to replicate the problem, jobs and failed_jobs updated properly. We need to be able to replicate the issue in order to solve the problem.

crynobone · 2023-12-19T06:40:27Z

Hey there, thanks for reporting this issue.

We'll need more info and/or code to debug this further. Can you please create a repository with the command below, commit the code that reproduces the issue as one separate commit on the main/master branch and share the repository here?

Please make sure that you have the latest version of the Laravel installer in order to run this command. Please also make sure you have both Git & the GitHub CLI tool properly set up.

laravel new bug-report --github="--public"

Do not amend and create a separate commit with your custom changes. After you've posted the repository, we'll try to reproduce the issue.

Thanks!

graemlourens · 2023-12-19T08:15:31Z

@crynobone Thank you for the time invested.

We'll as a next step try and reproduce this in a fresh laravel project to rule out the fact that this has something to do with our environment and get back to you asap.

* [10.x] Test Improvements Verify #49389 Signed-off-by: Mior Muhammad Zaki <[email protected]> * Apply fixes from StyleCI --------- Signed-off-by: Mior Muhammad Zaki <[email protected]> Co-authored-by: StyleCI Bot <[email protected]>

graemlourens · 2023-12-20T09:10:33Z

@crynobone

We have successfully reproduced the mentioned issue in a fresh standalone laravel project.
Attached is a ZIP file with a docker environment ready for testing

freshLaravel.zip

Step to reproduce

execute the command (in root directory:

docker-compose build
docker-compose up -d
docker-compose exec php bash
php artisan migrate
php artisan queue:work
open browser http://localhost
There will be one job dispatched which timesout inside transaction and one without
After queue:work dies (because of 'killed') you can go to http://localhost:8080 to verify that there is NO job logged in failed_jobs table
You can run php artisan queue:work again to process the second job, and verify that this one WILL be logged in failed_jobs table

Thank you for your consideration & time

crynobone · 2023-12-20T23:43:28Z

Can you push the repository to github as suggested above?

crynobone · 2023-12-21T23:39:40Z

@graemlourens It seems that you didn't specify https://laravel.com/docs/10.x/queues#max-job-attempts-and-timeout and I don't see how it would goes to failed_jobs without the option as it would goes back to jobs and ready for another attempt after each failure.

graemlourens · 2023-12-22T06:48:34Z

@crynobone We did not specify $tries as by default its 1 and therefore the setting was unnecessary, and we did set the $timeout on the job properly

However just to rule out any error on our side, we added tries=1 on the job, as well as also on the worker, and it had no influence, as suspected.

One Job is logged (the one without transaction) and one job is NOT logged (the one with transaction)

Both jobs are configured exactly the same, the only difference in them is that one has the sleep inside of a transaction and the other does not.

I'm not sure how better we can show you this bug. I highly suspect that you did not execute the steps we provided to reproduce the error or else you could confirm this behaviour by proof of one entry in the failed_jobs table.

If you reconsider to reopen this ticket, we'll be glad to push it as a repository as you requested.

graemlourens · 2023-12-22T07:07:11Z

@crynobone we've had further success in this matter:

Strangely the bug is fixed if you use 'Batchable' on the Job.

The reason the job is then logged correctly is because of Illuminate\Queue\Jobs\Job. In the 'fail' method there is a check on line 195:

if ($e instanceof TimeoutExceededException &&
            $commandName &&
            in_array(Batchable::class, class_uses_recursive($commandName))) {
            $batchRepository = $this->resolve(BatchRepository::class);

In this case you're running 'rollBack' on the $batchRepository. Even if the job is not attached to a batch, this fixes the problem.

However i don't think its correct for us to have to set all jobs as batchable to resolve this issue. I still believe this should also be handled for jobs without batches.

How do you see this?

crynobone · 2023-12-22T08:13:10Z

Tries doesn't have a default of 1:

framework/src/Illuminate/Queue/Jobs/Job.php

Line 278 in 047edbc

return $this->payload()['maxTries'] ?? null;

graemlourens · 2023-12-22T08:20:41Z

@crynobone ok, still however as mentioned, we set tries to 1 and it still happens.

Please also refer to my temporary solution that if you use 'Batchable' on the Job, it works, without it does not.

How do you feel about this?

crynobone · 2023-12-22T08:22:15Z

I cannot replicate the issue and your reproduction code doesn't show what you are claiming is true. This is why we advised sending the repository to GitHub.

graemlourens · 2023-12-22T08:23:46Z

@crynobone you are confirming you downloaded our ZIP, ran the commands we provided, and you did NOT see the outcome we described? I do not see how this is possible. I can provide a screenshare movie if you prefer.

Or are you telling me that you will only reproduce it when we have pushed it to repository on github?

crynobone · 2023-12-22T08:26:00Z

We'll need more info and/or code to debug this further. Can you please create a repository with the command below, commit the code that reproduces the issue as one separate commit on the main/master branch and share the repository here?

Having everything in a single commit allow us to completely eliminate any other possibility outside just the affected code instead having to verify evey files inside the zip file.

graemlourens · 2023-12-22T10:24:11Z

@crynobone a coworker of mine has uploaded the repo as requested: https://github.com/GrzegorzMorgas/laravel-bug-report

We were a little confused if to commit everything as ONE commit or the base commit of base setup and then a second commit with the changes required to reproduce the issue. The instructions are a little ambigous to us:

"commit the code ... as one separate commit"
but then
"Do not amend and create a separate commit with your custom changes"

Please let us know if what we created is ok and you can reproduce or you would like the repository in a different way

driesvints · 2023-12-22T12:32:02Z

@graemlourens use the command from here: #49389 (comment)

Then commit all custom changes separately.

graemlourens · 2023-12-23T07:01:25Z

@driesvints @crynobone Understood. We've done as requested:

https://github.com/GrzegorzMorgas/bug-report

Step to reproduce

execute the command (in root directory of the folder):

docker-compose build
docker-compose up -d
docker-compose exec php bash
cp .env.example .env
composer install
php artisan migrate
php artisan queue:work
open browser http://localhost
There will be one job dispatched which timesout inside transaction and one without
After queue:work dies (because of 'killed') you can go to http://localhost:8080 to verify that there is NO job logged in failed_jobs table
You can run php artisan queue:work again to process the second job, and verify that this one WILL be logged in failed_jobs table

Important find

If you 'use Batchable' on the Job, then also the one timeouting with transaction suddenly works. We have traced this back to

if ($e instanceof TimeoutExceededException &&
            $commandName &&
            in_array(Batchable::class, class_uses_recursive($commandName))) {
            $batchRepository = $this->resolve(BatchRepository::class);

In Illuminate\Queue\Jobs\Job

Because there is a rollback there, it seems to work. However we believe it should not be necessary to set all jobs as Batchable to get around this bug.

Thank you for your consideration & time

driesvints · 2023-12-23T07:54:58Z

Thanks @graemlourens. Might be a while before we can dig into this because we're all heading into holidays.

gkmk · 2023-12-30T16:30:20Z

I have reviewed this issue and it may not be an actual bug after all. This happens because of the async signal handler.
src/Illuminate/Queue/Worker.php Line 167 registerTimeoutHandler will kill the current process if timeout reached. At this point it will try to log the failed job in the database but since the transaction was opened, the insert command will not persist.

Here are two solutions:

I recommend updating the Laravel Docs about this case and use the public function failed(\Throwable $exception): void function inside your job to rollback the transaction. I tested this and it works fine (failed job is logged).
If this issue is still found as actual bug, we will need to reset the database connection somewhere in the Worker registerTimeoutHandler function. I dont think this should fall under the scope of the framework though and should be handled within the job failed function instead (point 1)

graemlourens · 2023-12-30T19:03:08Z

@gkmk Thank you for your insight and suggestions.

The reason i would tend to shy away from option 1) is because then behaviour would be different between jobs that are batchable, and ones that are not. This would be an unnecessary inconsistency in my opinion. (See my previous comment)

On a more subjective note, we have > 100 jobs, most of them have transactions and do not require failed methods. Option 2) would add failed methods just to handle this specific case to not loose jobs, which seems to be quite an overhead.

Will be interesting to see what the core of laravel developers have to say.

crynobone · 2024-01-02T00:59:09Z

is because then behaviour would be different between jobs that are batchable, and ones that are not. This would be an unnecessary inconsistency in my opinion.

Yes, this is intended because Batchable Jobs offer catch and finally which need to be called within the same process while normal queue typically rely on n+1 (n = number of retries) to throw error.

would add failed methods just to handle this specific case to not loose jobs, which seems to be quite an overhead.

The documentation already cover this in detail:

https://laravel.com/docs/10.x/queues#timeout

graemlourens · 2024-01-02T05:13:42Z

@crynobone @driesvints @themsaid we disagree with your assessment and rest our case. We hope that in future there will be a fix for this inconsistency which will reduce the chance of loosing jobs, which we find is crucial to be able to rely on laravel queues as a major component for larger applications.

for anybody else worried about this: Our solution will most likely be to declare all Jobs as Batchable, which seems to not have any ill-effect and even if the job is not dispatched within a batch it therefore handles the transaction correctly so that the job is logged properly in case of timeout within a transaction.

crynobone · 2024-01-02T05:40:48Z

for anybody else worried about this: Our solution will most likely be to declare all Jobs as Batchable,

Read the documentation, the worker can be restarted using a process manager (which should be the default usage in any production applications anyway), or use queue:listen (uses more CPUs if you insist to not going with process manager).

graemlourens · 2024-01-02T05:44:18Z

@crynobone we use laravel horizon which automatically handles that. But please be aware: If the job had $tries=1 (which most of our jobs do) the job is NOT retried, and NOT logged, and therefore lost.

Please correct my if i'm wrong. This is the whole point of this ticket - loosing jobs

crynobone added a commit that referenced this issue Dec 19, 2023

[10.x] Test Improvements

aeab316

Verify #49389 Signed-off-by: Mior Muhammad Zaki <[email protected]>

crynobone mentioned this issue Dec 19, 2023

[10.x] Test Improvements #49426

Merged

crynobone added the needs more info label Dec 19, 2023

crynobone closed this as completed Dec 21, 2023

crynobone removed the needs more info label Dec 21, 2023

driesvints mentioned this issue Dec 22, 2023

Jobs causing memory overflow do not consider $maxExceptions (and are retried endlessly if $tries=0) laravel/horizon#1346

Closed

driesvints reopened this Dec 23, 2023

driesvints added the bug label Dec 23, 2023

driesvints assigned themsaid and unassigned themsaid Jan 1, 2024

driesvints assigned crynobone Jan 1, 2024

crynobone closed this as completed Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job not logged in failed_jobs if timeout occurs within database transaction #49389

Job not logged in failed_jobs if timeout occurs within database transaction #49389

graemlourens commented Dec 15, 2023

crynobone commented Dec 19, 2023

crynobone commented Dec 19, 2023

graemlourens commented Dec 19, 2023

graemlourens commented Dec 20, 2023

crynobone commented Dec 20, 2023

crynobone commented Dec 21, 2023

graemlourens commented Dec 22, 2023

graemlourens commented Dec 22, 2023

crynobone commented Dec 22, 2023

graemlourens commented Dec 22, 2023

crynobone commented Dec 22, 2023

graemlourens commented Dec 22, 2023

crynobone commented Dec 22, 2023 •

edited

Loading

graemlourens commented Dec 22, 2023 •

edited

Loading

driesvints commented Dec 22, 2023

graemlourens commented Dec 23, 2023

driesvints commented Dec 23, 2023

gkmk commented Dec 30, 2023

graemlourens commented Dec 30, 2023 •

edited

Loading

crynobone commented Jan 2, 2024

graemlourens commented Jan 2, 2024

crynobone commented Jan 2, 2024

graemlourens commented Jan 2, 2024 •

edited

Loading

Job not logged in failed_jobs if timeout occurs within database transaction #49389

Job not logged in failed_jobs if timeout occurs within database transaction #49389

Comments

graemlourens commented Dec 15, 2023

Laravel Version

PHP Version

Database Driver & Version

Description

Steps To Reproduce

crynobone commented Dec 19, 2023

crynobone commented Dec 19, 2023

graemlourens commented Dec 19, 2023

graemlourens commented Dec 20, 2023

Step to reproduce

crynobone commented Dec 20, 2023

crynobone commented Dec 21, 2023

graemlourens commented Dec 22, 2023

graemlourens commented Dec 22, 2023

crynobone commented Dec 22, 2023

graemlourens commented Dec 22, 2023

crynobone commented Dec 22, 2023

graemlourens commented Dec 22, 2023

crynobone commented Dec 22, 2023 • edited Loading

graemlourens commented Dec 22, 2023 • edited Loading

driesvints commented Dec 22, 2023

graemlourens commented Dec 23, 2023

Step to reproduce

Important find

driesvints commented Dec 23, 2023

gkmk commented Dec 30, 2023

graemlourens commented Dec 30, 2023 • edited Loading

crynobone commented Jan 2, 2024

graemlourens commented Jan 2, 2024

crynobone commented Jan 2, 2024

graemlourens commented Jan 2, 2024 • edited Loading

crynobone commented Dec 22, 2023 •

edited

Loading

graemlourens commented Dec 22, 2023 •

edited

Loading

graemlourens commented Dec 30, 2023 •

edited

Loading

graemlourens commented Jan 2, 2024 •

edited

Loading