Skip to content

Remove async-signal-unsafe code from signal handler #10275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

TimeToogo
Copy link

@TimeToogo TimeToogo commented Apr 30, 2025

Fixes #10274


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@TimeToogo
Copy link
Author

Hey @edsiper, do you have any timeline for reviewing this PR?
This would help us decide if we need run our own fork of fluent.
Thanks,
Elliot

@edsiper
Copy link
Member

edsiper commented May 8, 2025

thanks for this contribution.

Instead of removing the calls, please provide an alternative solution where the functionality keeps the same.

@TimeToogo
Copy link
Author

Hey @edsiper,
I've updated the implementation to maintain the same functionality as before.

@leonardo-albertovich
Copy link
Collaborator

leonardo-albertovich commented May 9, 2025

Sorry, I clicked close accidentally.

@leonardo-albertovich
Copy link
Collaborator

I have a few reservations about this, maybe we can figure things out togther :

  1. I can't find the licensing terms for the code you included from Howard Hinnant
  2. I don't think "rolling our own" timestamp to date is a wise choice
  3. The removal of snprintf from the timestamp part of the log line generation seems pointless when shortly after that flb_dump or flb_utils_error can be invoked which end up using all of those.
  4. In the context of a signal handler we're not even supposed to access the TLS (which is done before by flb_signal_handler)
  5. Is it correct for flb_signal_handler to call flb_signal_init? I didn't check that but it caught my attention

According to the documentation we should use pthread_sigmask to ensure that we control which thread handles signals and in the signal handler itself we should only set a flag we can check once we resume the process.

Unless there is a very strong reason for this PR to be merged I think we should focus on fixing the logical error from the root instead, otherwise we could be making our own lives harder when we have to face this (and maintain the newly added code) in the future.

There's a PR from @pwhelan where he's dealing with something related so you might want to talk to him to ensure that you don't end up duplicating the effort and wasting your time.

@pwhelan
Copy link
Contributor

pwhelan commented May 9, 2025

There's a PR from @pwhelan where he's dealing with something related so you might want to talk to him to ensure that you don't end up duplicating the effort and wasting your time.

This pull request seems to be working on another aspect so it should not interfere.

@pwhelan
Copy link
Contributor

pwhelan commented May 9, 2025

I have a few reservations about this, maybe we can figure things out togther :

  1. I can't find the licensing terms for the code you included from Howard Hinnant

The code is from here, as far as I can tell: https://howardhinnant.github.io/date_algorithms.html#civil_from_days.

It seems to be derived from this code (same author as the blog where it comes from): https://github.com/HowardHinnant/date/blob/a45ea7c17b4a7f320e199b71436074bd624c9e15/include/date/date.h#L3087

Which seems to be under the MIT license: https://github.com/HowardHinnant/date/blob/master/LICENSE.txt.

Maybe not what I was being asked about but there it is.

@TimeToogo
Copy link
Author

TimeToogo commented May 10, 2025

@leonardo-albertovich I've updated the implementation to use a small buffer for storing signals and then moved the processing of them to the main thread. This is better than trying to shoehorn the functionality into async-safe code.
Because this is affecting customers of ours we are going to use our fork instead for now.

@leonardo-albertovich
Copy link
Collaborator

@TimeToogo would it be possible for you to share some information on how it's triggering in your deployments? That'd be very helpful.

@TimeToogo
Copy link
Author

TimeToogo commented May 12, 2025

@TimeToogo would it be possible for you to share some information on how it's triggering in your deployments? That'd be very helpful.

@leonardo-albertovich I put some details in #10274, we reload fluent-bit via the API to reload the config after it changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock during reload
4 participants