Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job not logged in failed_jobs if timeout occurs within database transaction #49389

Closed
graemlourens opened this issue Dec 15, 2023 · 23 comments
Closed
Assignees
Labels

Comments

@graemlourens
Copy link

Laravel Version

10.37.3

PHP Version

8.2.7

Database Driver & Version

MariaDB 10.3.27

Description

We had a 'phantom' problem that jobs that failed, were not logged to the failed_jobs table. After some investigation we found out that by mistake we were doing a heavy process within a db transaction which lead the specific job to timeout.

After testing we found out that the issue only happens if the jobs timeout during an open db transaction

I reported this in laravel/horizon but it turns out it has nothing to do with horizon but seems to be a framework issue.

Steps To Reproduce

  • Create a job with a timeout of 5 seconds
  • In the job handle method sleep within a db transaction
    public function handle(): void
    {
        DB::transaction(function (): void {
            sleep(10);
        });
    }

We would expect a job in the failed_jobs table, however this is not the case, and the JobFailed event is also not dispatched

crynobone added a commit that referenced this issue Dec 19, 2023
Verify #49389

Signed-off-by: Mior Muhammad Zaki <[email protected]>
@crynobone
Copy link
Member

HI there,

I added the following tests to verify the issue and was unable to replicate the problem, jobs and failed_jobs updated properly. We need to be able to replicate the issue in order to solve the problem.

@crynobone
Copy link
Member

Hey there, thanks for reporting this issue.

We'll need more info and/or code to debug this further. Can you please create a repository with the command below, commit the code that reproduces the issue as one separate commit on the main/master branch and share the repository here?

Please make sure that you have the latest version of the Laravel installer in order to run this command. Please also make sure you have both Git & the GitHub CLI tool properly set up.

laravel new bug-report --github="--public"

Do not amend and create a separate commit with your custom changes. After you've posted the repository, we'll try to reproduce the issue.

Thanks!

@graemlourens
Copy link
Author

@crynobone Thank you for the time invested.

We'll as a next step try and reproduce this in a fresh laravel project to rule out the fact that this has something to do with our environment and get back to you asap.

taylorotwell pushed a commit that referenced this issue Dec 19, 2023
* [10.x] Test Improvements

Verify #49389

Signed-off-by: Mior Muhammad Zaki <[email protected]>

* Apply fixes from StyleCI

---------

Signed-off-by: Mior Muhammad Zaki <[email protected]>
Co-authored-by: StyleCI Bot <[email protected]>
@graemlourens
Copy link
Author

@crynobone

We have successfully reproduced the mentioned issue in a fresh standalone laravel project.
Attached is a ZIP file with a docker environment ready for testing

freshLaravel.zip

Step to reproduce

execute the command (in root directory:

  • docker-compose build

  • docker-compose up -d

  • docker-compose exec php bash

  • php artisan migrate

  • php artisan queue:work

  • open browser http://localhost

  • There will be one job dispatched which timesout inside transaction and one without

  • After queue:work dies (because of 'killed') you can go to http://localhost:8080 to verify that there is NO job logged in failed_jobs table

  • You can run php artisan queue:work again to process the second job, and verify that this one WILL be logged in failed_jobs table

Thank you for your consideration & time

@crynobone
Copy link
Member

Can you push the repository to github as suggested above?

@crynobone
Copy link
Member

@graemlourens It seems that you didn't specify https://laravel.com/docs/10.x/queues#max-job-attempts-and-timeout and I don't see how it would goes to failed_jobs without the option as it would goes back to jobs and ready for another attempt after each failure.

@graemlourens
Copy link
Author

@crynobone We did not specify $tries as by default its 1 and therefore the setting was unnecessary, and we did set the $timeout on the job properly

However just to rule out any error on our side, we added tries=1 on the job, as well as also on the worker, and it had no influence, as suspected.

One Job is logged (the one without transaction) and one job is NOT logged (the one with transaction)

Both jobs are configured exactly the same, the only difference in them is that one has the sleep inside of a transaction and the other does not.

I'm not sure how better we can show you this bug. I highly suspect that you did not execute the steps we provided to reproduce the error or else you could confirm this behaviour by proof of one entry in the failed_jobs table.

If you reconsider to reopen this ticket, we'll be glad to push it as a repository as you requested.

@graemlourens
Copy link
Author

@crynobone we've had further success in this matter:

Strangely the bug is fixed if you use 'Batchable' on the Job.

The reason the job is then logged correctly is because of Illuminate\Queue\Jobs\Job. In the 'fail' method there is a check on line 195:

if ($e instanceof TimeoutExceededException &&
            $commandName &&
            in_array(Batchable::class, class_uses_recursive($commandName))) {
            $batchRepository = $this->resolve(BatchRepository::class);

In this case you're running 'rollBack' on the $batchRepository. Even if the job is not attached to a batch, this fixes the problem.

However i don't think its correct for us to have to set all jobs as batchable to resolve this issue. I still believe this should also be handled for jobs without batches.

How do you see this?

@crynobone
Copy link
Member

Tries doesn't have a default of 1:

return $this->payload()['maxTries'] ?? null;

@graemlourens
Copy link
Author

@crynobone ok, still however as mentioned, we set tries to 1 and it still happens.

Please also refer to my temporary solution that if you use 'Batchable' on the Job, it works, without it does not.

How do you feel about this?

@crynobone
Copy link
Member

I cannot replicate the issue and your reproduction code doesn't show what you are claiming is true. This is why we advised sending the repository to GitHub.

@graemlourens
Copy link
Author

@crynobone you are confirming you downloaded our ZIP, ran the commands we provided, and you did NOT see the outcome we described? I do not see how this is possible. I can provide a screenshare movie if you prefer.

Or are you telling me that you will only reproduce it when we have pushed it to repository on github?

@crynobone
Copy link
Member

crynobone commented Dec 22, 2023

We'll need more info and/or code to debug this further. Can you please create a repository with the command below, commit the code that reproduces the issue as one separate commit on the main/master branch and share the repository here?

Having everything in a single commit allow us to completely eliminate any other possibility outside just the affected code instead having to verify evey files inside the zip file.

@graemlourens
Copy link
Author

graemlourens commented Dec 22, 2023

@crynobone a coworker of mine has uploaded the repo as requested: https://github.com/GrzegorzMorgas/laravel-bug-report

We were a little confused if to commit everything as ONE commit or the base commit of base setup and then a second commit with the changes required to reproduce the issue. The instructions are a little ambigous to us:

"commit the code ... as one separate commit"
but then
"Do not amend and create a separate commit with your custom changes"

Please let us know if what we created is ok and you can reproduce or you would like the repository in a different way

@driesvints
Copy link
Member

@graemlourens use the command from here: #49389 (comment)

Then commit all custom changes separately.

@graemlourens
Copy link
Author

@driesvints @crynobone Understood. We've done as requested:

https://github.com/GrzegorzMorgas/bug-report

Step to reproduce

execute the command (in root directory of the folder):

  • docker-compose build

  • docker-compose up -d

  • docker-compose exec php bash

  • cp .env.example .env

  • composer install

  • php artisan migrate

  • php artisan queue:work

  • open browser http://localhost

  • There will be one job dispatched which timesout inside transaction and one without

  • After queue:work dies (because of 'killed') you can go to http://localhost:8080 to verify that there is NO job logged in failed_jobs table

  • You can run php artisan queue:work again to process the second job, and verify that this one WILL be logged in failed_jobs table

Important find

If you 'use Batchable' on the Job, then also the one timeouting with transaction suddenly works. We have traced this back to

if ($e instanceof TimeoutExceededException &&
            $commandName &&
            in_array(Batchable::class, class_uses_recursive($commandName))) {
            $batchRepository = $this->resolve(BatchRepository::class);

In Illuminate\Queue\Jobs\Job

Because there is a rollback there, it seems to work. However we believe it should not be necessary to set all jobs as Batchable to get around this bug.

Thank you for your consideration & time

@driesvints
Copy link
Member

Thanks @graemlourens. Might be a while before we can dig into this because we're all heading into holidays.

@driesvints driesvints reopened this Dec 23, 2023
@driesvints driesvints added the bug label Dec 23, 2023
@gkmk
Copy link

gkmk commented Dec 30, 2023

I have reviewed this issue and it may not be an actual bug after all. This happens because of the async signal handler.
src/Illuminate/Queue/Worker.php Line 167 registerTimeoutHandler will kill the current process if timeout reached. At this point it will try to log the failed job in the database but since the transaction was opened, the insert command will not persist.

Here are two solutions:

  1. I recommend updating the Laravel Docs about this case and use the public function failed(\Throwable $exception): void function inside your job to rollback the transaction. I tested this and it works fine (failed job is logged).
  2. If this issue is still found as actual bug, we will need to reset the database connection somewhere in the Worker registerTimeoutHandler function. I dont think this should fall under the scope of the framework though and should be handled within the job failed function instead (point 1)

@graemlourens
Copy link
Author

graemlourens commented Dec 30, 2023

@gkmk Thank you for your insight and suggestions.

The reason i would tend to shy away from option 1) is because then behaviour would be different between jobs that are batchable, and ones that are not. This would be an unnecessary inconsistency in my opinion. (See my previous comment)

On a more subjective note, we have > 100 jobs, most of them have transactions and do not require failed methods. Option 2) would add failed methods just to handle this specific case to not loose jobs, which seems to be quite an overhead.

Will be interesting to see what the core of laravel developers have to say.

@driesvints driesvints assigned themsaid and unassigned themsaid Jan 1, 2024
@crynobone
Copy link
Member

is because then behaviour would be different between jobs that are batchable, and ones that are not. This would be an unnecessary inconsistency in my opinion.

Yes, this is intended because Batchable Jobs offer catch and finally which need to be called within the same process while normal queue typically rely on n+1 (n = number of retries) to throw error.

would add failed methods just to handle this specific case to not loose jobs, which seems to be quite an overhead.

The documentation already cover this in detail:

CleanShot 2024-01-02 at 08 58 32

https://laravel.com/docs/10.x/queues#timeout

@graemlourens
Copy link
Author

@crynobone @driesvints @themsaid we disagree with your assessment and rest our case. We hope that in future there will be a fix for this inconsistency which will reduce the chance of loosing jobs, which we find is crucial to be able to rely on laravel queues as a major component for larger applications.

for anybody else worried about this: Our solution will most likely be to declare all Jobs as Batchable, which seems to not have any ill-effect and even if the job is not dispatched within a batch it therefore handles the transaction correctly so that the job is logged properly in case of timeout within a transaction.

@crynobone
Copy link
Member

for anybody else worried about this: Our solution will most likely be to declare all Jobs as Batchable,

Read the documentation, the worker can be restarted using a process manager (which should be the default usage in any production applications anyway), or use queue:listen (uses more CPUs if you insist to not going with process manager).

@graemlourens
Copy link
Author

graemlourens commented Jan 2, 2024

@crynobone we use laravel horizon which automatically handles that. But please be aware: If the job had $tries=1 (which most of our jobs do) the job is NOT retried, and NOT logged, and therefore lost.

Please correct my if i'm wrong. This is the whole point of this ticket - loosing jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants