partial cancel: transition partial cancel final error to warning #1309

trws · 2024-10-09T21:25:23Z

No description provided.

garlick · 2024-10-09T21:37:20Z

I guess it's obvious but reducing the log level below LOG_ERR means it won't be seen on the user's stderr. The bug that was just found and fixed would not have been seen at all since it didn't occur in the node-exclusive configured system instance where logs are persistent.

So good idea/bad idea? Hmm.

milroy · 2024-10-18T01:35:36Z

The PR I have out to fix issue #1284 and parts of rzadams-related issues demotes the error to a warning since it will be a common occurrence on clusters with ssd vertices that aren't attached to a broker rank: 6a3eecf

I can't think of a way to distinguish between a state inconsistency that indicates an error from an inconsistency related to canceling brokerless vertices.

milroy · 2024-10-18T23:48:30Z

@jameshcorbett pointed out that even the LOG_WARNING will fill the logs on systems with rabbits or brokerless ssds. I think this is a compelling argument for making it a LOG_DEBUG.

trws · 2024-10-19T18:16:58Z

This might be a case for keeping the “is it an ssd” check, as a way to say “is this because of rabbits” check just wrapped around the warning message. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Daniel Milroy ***@***.***> Sent: Friday, October 18, 2024 4:48:52 PM To: flux-framework/flux-sched ***@***.***> Cc: Scogland, Tom ***@***.***>; Author ***@***.***> Subject: Re: [flux-framework/flux-sched] partial cancel: transition partial cancel final error to warning (Issue #1309) @jameshcorbett<https://urldefense.us/v3/__https://github.com/jameshcorbett__;!!G2kpM7uM-TzIFchu!2iUJZT3G-4PgyXXi1RTjspeANmvBEY6ts0uDAqWXrsBKER6iUT72TfFN7NATYQq9_GialN8ilWSrZhqxgr5Jqu0oDW0$> pointed out that even the LOG_WARNING will fill the logs on systems with rabbits or brokerless ssds. I think this is a compelling argument for making it a LOG_DEBUG. — Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/issues/1309*issuecomment-2423380417__;Iw!!G2kpM7uM-TzIFchu!2iUJZT3G-4PgyXXi1RTjspeANmvBEY6ts0uDAqWXrsBKER6iUT72TfFN7NATYQq9_GialN8ilWSrZhqxgr5JxwDbO74$>, or unsubscribe<https://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNJUYCLDBUWT5UO7KZTZ4GM6JAVCNFSM6AAAAABPVNHLQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRTGM4DANBRG4__;!!G2kpM7uM-TzIFchu!2iUJZT3G-4PgyXXi1RTjspeANmvBEY6ts0uDAqWXrsBKER6iUT72TfFN7NATYQq9_GialN8ilWSrZhqxgr5JhhJnv1o$>. You are receiving this because you authored the thread.Message ID: ***@***.***>

milroy · 2024-10-20T07:06:55Z

I should add that depending on the pruning filter settings an error/warning/debug may not get logged in this case.

In qmanager, that condition is met when the partial cancels didn't set full_removal to be true. That can happen if there's some state mismatch between core and fluxion (a true error), or it can happen if brokerless resources are defined in the pruning filters (e.g., ALL:ssd,All:core). The pruning filter for ssd will cause the partial cancel logic to detect that not all filter resource spans have been removed (hence full_removal will be false).

However, if the pruning filter is set to default (All:core) and there are brokerless ssds they will get detected by the "is it an ssd" check and the error/warning/debug condition won't be met in qmanager. You may wonder why having additional information in the pruning filter can result in additional work (i.e. the cleanup full cancel) and this logging ambiguity. It's because the cancel traversals are pre-order, so the pruning filter in the cluster vertex is checked first.

So I actually don't think the "is it an ssd" check helps disambiguate the cases.

milroy · 2024-10-21T00:43:30Z

What I could do is add a bool output parameter to cancel to indicate whether brokerless vertices were detected. That would allow logging a warning/debug or an error depending on the bool value.

milroy · 2024-11-05T04:12:25Z

@trws: can we close this issue? I think PR #1292 addresses the issue as the final error condition should be encountered rarely and only as a result of a genuine error.

trws · 2024-11-05T04:13:23Z

Yup, I think so. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Daniel Milroy ***@***.***> Sent: Monday, November 4, 2024 8:12:46 PM To: flux-framework/flux-sched ***@***.***> Cc: Scogland, Tom ***@***.***>; Mention ***@***.***> Subject: Re: [flux-framework/flux-sched] partial cancel: transition partial cancel final error to warning (Issue #1309) @trws<https://urldefense.us/v3/__https://github.com/trws__;!!G2kpM7uM-TzIFchu!x39jpQeFcJhW4sSeEq30nSGVGLgBIS6NK3KI8QS3GXCnKRmejP4ZIzZ1vSgS5v_T65mwv-JnWnNij_k08dCJjjzCp2c$>: can we close this issue? I think PR #1292<https://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/pull/1292__;!!G2kpM7uM-TzIFchu!x39jpQeFcJhW4sSeEq30nSGVGLgBIS6NK3KI8QS3GXCnKRmejP4ZIzZ1vSgS5v_T65mwv-JnWnNij_k08dCJsxysi5U$> addresses the issue as the final error condition should be encountered rarely and only as a result of a genuine error. — Reply to this email directly, view it on GitHub<https://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/issues/1309*issuecomment-2456194147__;Iw!!G2kpM7uM-TzIFchu!x39jpQeFcJhW4sSeEq30nSGVGLgBIS6NK3KI8QS3GXCnKRmejP4ZIzZ1vSgS5v_T65mwv-JnWnNij_k08dCJbkMMYXA$>, or unsubscribe<https://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNKR3T5MURUUVMK6YYTZ7AZS5AVCNFSM6AAAAABPVNHLQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWGE4TIMJUG4__;!!G2kpM7uM-TzIFchu!x39jpQeFcJhW4sSeEq30nSGVGLgBIS6NK3KI8QS3GXCnKRmejP4ZIzZ1vSgS5v_T65mwv-JnWnNij_k08dCJpjtzo2M$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

milroy mentioned this issue Oct 17, 2024

Resource: support partial cancel of resources external to broker ranks #1292

Merged

trws closed this as completed Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partial cancel: transition partial cancel final error to warning #1309

partial cancel: transition partial cancel final error to warning #1309

trws commented Oct 9, 2024

garlick commented Oct 9, 2024

milroy commented Oct 18, 2024

milroy commented Oct 18, 2024

trws commented Oct 19, 2024 via email

milroy commented Oct 20, 2024 •

edited

Loading

milroy commented Oct 21, 2024

milroy commented Nov 5, 2024

trws commented Nov 5, 2024 via email

partial cancel: transition partial cancel final error to warning #1309

partial cancel: transition partial cancel final error to warning #1309

Comments

trws commented Oct 9, 2024

garlick commented Oct 9, 2024

milroy commented Oct 18, 2024

milroy commented Oct 18, 2024

trws commented Oct 19, 2024 via email

milroy commented Oct 20, 2024 • edited Loading

milroy commented Oct 21, 2024

milroy commented Nov 5, 2024

trws commented Nov 5, 2024 via email

milroy commented Oct 20, 2024 •

edited

Loading