Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReReco ACDCs failing to split Work #12188

Closed
hassan11196 opened this issue Nov 27, 2024 · 2 comments
Closed

ReReco ACDCs failing to split Work #12188

hassan11196 opened this issue Nov 27, 2024 · 2 comments

Comments

@hassan11196
Copy link
Member

Impact of the bug
Global WorkQueue

Describe the bug
I was debugging ACDCs of ReReco wfs and while going through the Global WorkQueue logs, I found the following error

Error: local variable 'rejectedWork' referenced before assignment
INFO:reqmgrInteraction:Trying to add more work for: cmsunified_ACDC3_Run2024E_JetMET0_PromptBTVJMENano_v2_241105_094856_6606
ERROR:reqmgrInteraction:Generic exception adding work to WQE inbox: {'Status': 'Running', 'TeamName': 'production', 'WMBSUrl': None, 'RequestName': 'cmsunified_ACDC3_Run2024E_JetMET0_PromptBTVJMENano_v2_241105_094856_6606', 'StartPolicy': {'OpenRunningTimeout': 2419200, 'policyName': 'ResubmitBlock', 'SliceType': 'NumberOfFiles', 'SplittingAlgo': 'ParentlessMergeBySize', 'SliceSize': 1}, 'EndPolicy': {'policyName': 'SingleShot'}, 'OpenForNewData': True, 'Inputs': {}, 'ProcessedInputs': ['/acdc/cmsunified_ACDC3_Run2024E_JetMET0_PromptBTVJMENano_v2_241105_094856_6606/:cmsunified_ACDC2_Run2024E_JetMET0_PromptBTVJMENano_v2_241010_214849_7327:DataProcessingMergeNANOEDMAODoutput/0/7'], 'RejectedInputs': [], 'PileupData': {}, 'ParentData': {}, 'ParentFlag': False, 'Jobs': 0, 'SiteWhitelist': [], 'SiteBlacklist': [], 'Dbs': None, 'ParentQueueId': None, 'Priority': 0, 'SubscriptionId': None, 'EventsWritten': 0, 'FilesProcessed': 0, 'PercentComplete': 100, 'PercentSuccess': 0, 'TaskName': None, 'ACDC': {}, 'ChildQueueUrl': None, 'ParentQueueUrl': None, 'NumberOfLumis': 0, 'NumberOfEvents': 0, 'NumberOfFiles': 0, 'NumOfFilesAdded': 0, 'Mask': None, 'TimestampFoundNewData': 1730803833, 'NoInputUpdate': False, 'NoPileupUpdate': False, 'CreationTime': 1730803833.574504, 'WMSpec': <WMCore.WMSpec.WMWorkload.WMWorkloadHelper object at 0x7fdf9c556f70>, 'Task': None}. Error: local variable 'rejectedWork' referenced before assignment

which is thrown by the following line,
https://github.com/hassan11196/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueue.py#L1096

I tried to reproduce the error for a wf, at least to see what goes inside the loop, and notice that the loop is exited before the rejectedWork variable is assigned.

image

Now, I need to understand

  • if the wfs are stuck due to this failure or not
  • why does it only affect ReReco Workflows

How to reproduce it
Steps to reproduce the behavior:
Use any of the following wfs with the following piece of code

INFO:reqmgrInteraction:Added 0 new elements for request: cmsunified_ACDC1_Run2024G_Muon1_ZMu_PromptMUODPGNano_241105_094516_3263
INFO:reqmgrInteraction:Added 0 new elements for request: cmsunified_ACDC1_Run2024G_Muon1_ZMu_PromptMUODPGNano_241105_094536_6961
INFO:reqmgrInteraction:Added 0 new elements for request: cmsunified_ACDC2_Run2024F_Muon1_PromptMUOPOGNano_241105_094752_6353
INFO:reqmgrInteraction:Added 0 new elements for request: cmsunified_ACDC3_Run2024D_Muon0_PromptMUOPOGNano_241105_094735_2323
INFO:reqmgrInteraction:Added 0 new elements for request: cmsunified_ACDC3_Run2024E_EGamma0_PromptEGMNano_v2_241105_094838_4316
INFO:reqmgrInteraction:Added 0 new elements for request: cmsunified_ACDC3_Run2024E_JetMET0_PromptBTVJMENano_v2_241105_094856_6606
from WMCore.WMSpec.WMWorkload import WMWorkloadHelper
from WMCore.WMSpec.WMWorkload import WMWorkloadHelper, getWorkloadFromTask
from WMCore.WorkQueue.Policy.Start import startPolicy
from WMCore.Services.Rucio.Rucio import Rucio

## Fetch spec file
spec_file = rmr.getSpec('cmsunified_ACDC3_Run2024E_JetMET0_PromptBTVJMENano_v2_241105_094856_6606')
wh = WMWorkloadHelper(spec_file)
for topLevelTask in wh.taskIterator():
    spec = getWorkloadFromTask(topLevelTask)
    policyName = spec.startPolicy()
    print(policyName)
    policy = startPolicy(policyName, {'ResubmitBlock': {'args': {}, 'name':'ResubmitBlock'}},
                                 rucioObj=rucioObj)
    print(policy.supportsWorkAddition())
    if not policy.supportsWorkAddition() and True:
         print('continue - loops exit before rejectedWork is assigned')

Expected behavior
rejectedWork to be defiend outside of the loop.

FYI @mapellidario

@mapellidario
Copy link
Member

I think i saw some conversation about this topic in the recent days, but i can not find it now. I remember Alan saying that yes this error is not clear and that we can improve the message, but that this means that something is broken with the workflow itself and that the cause for this error must be searched elsewhere.

Moreover, this feels related to #11681

@amaltaro
Copy link
Contributor

@hassan11196 thank you for creating this ticket.
As Dario correctly pointed out, this ticket is a duplicate of #11681

Given that you provided more content and some useful information, I will link this ticket in the previous one and close this as a duplicate. Thank you again for taking the time to report this!

@amaltaro amaltaro closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants