Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial cancel: transition partial cancel final error to warning #1309

Closed
trws opened this issue Oct 9, 2024 · 8 comments
Closed

partial cancel: transition partial cancel final error to warning #1309

trws opened this issue Oct 9, 2024 · 8 comments

Comments

@trws
Copy link
Member

trws commented Oct 9, 2024

No description provided.

@garlick
Copy link
Member

garlick commented Oct 9, 2024

I guess it's obvious but reducing the log level below LOG_ERR means it won't be seen on the user's stderr. The bug that was just found and fixed would not have been seen at all since it didn't occur in the node-exclusive configured system instance where logs are persistent.

So good idea/bad idea? Hmm.

@milroy
Copy link
Member

milroy commented Oct 18, 2024

The PR I have out to fix issue #1284 and parts of rzadams-related issues demotes the error to a warning since it will be a common occurrence on clusters with ssd vertices that aren't attached to a broker rank: 6a3eecf

I can't think of a way to distinguish between a state inconsistency that indicates an error from an inconsistency related to canceling brokerless vertices.

@milroy
Copy link
Member

milroy commented Oct 18, 2024

@jameshcorbett pointed out that even the LOG_WARNING will fill the logs on systems with rabbits or brokerless ssds. I think this is a compelling argument for making it a LOG_DEBUG.

@trws
Copy link
Member Author

trws commented Oct 19, 2024 via email

@milroy
Copy link
Member

milroy commented Oct 20, 2024

I should add that depending on the pruning filter settings an error/warning/debug may not get logged in this case.

In qmanager, that condition is met when the partial cancels didn't set full_removal to be true. That can happen if there's some state mismatch between core and fluxion (a true error), or it can happen if brokerless resources are defined in the pruning filters (e.g., ALL:ssd,All:core). The pruning filter for ssd will cause the partial cancel logic to detect that not all filter resource spans have been removed (hence full_removal will be false).

However, if the pruning filter is set to default (All:core) and there are brokerless ssds they will get detected by the "is it an ssd" check and the error/warning/debug condition won't be met in qmanager. You may wonder why having additional information in the pruning filter can result in additional work (i.e. the cleanup full cancel) and this logging ambiguity. It's because the cancel traversals are pre-order, so the pruning filter in the cluster vertex is checked first.

So I actually don't think the "is it an ssd" check helps disambiguate the cases.

@milroy
Copy link
Member

milroy commented Oct 21, 2024

What I could do is add a bool output parameter to cancel to indicate whether brokerless vertices were detected. That would allow logging a warning/debug or an error depending on the bool value.

@milroy
Copy link
Member

milroy commented Nov 5, 2024

@trws: can we close this issue? I think PR #1292 addresses the issue as the final error condition should be encountered rarely and only as a result of a genuine error.

@trws
Copy link
Member Author

trws commented Nov 5, 2024 via email

@trws trws closed this as completed Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants