Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission Deferral for HammerCloud does not work anymore #8852

Closed
belforte opened this issue Nov 29, 2024 · 3 comments
Closed

Submission Deferral for HammerCloud does not work anymore #8852

belforte opened this issue Nov 29, 2024 · 3 comments

Comments

@belforte
Copy link
Member

as reported by @stlammel [2]

see documentation of the deferral in 1

problem is that PreJob looks for a non-zero value of CRAB_JobReleaseTimeout in the dag boostrap classAds.

The way for user to indicate that was implemented ([1]) via adding in crabConfig.

config.Debug.extraJDL=['+CRAB_JobReleaseTimeout=Nsec'] 

But extraJDL can also contain things which make DAG bootstrap fail , see #8784

So I had to stop passing extraJDL ads to the dagman.jdl and PreJob does not find the ad anymore.

Need a cleaner and explicit way to put CRAB_JobReleaseTimeout in DAGJob.jdl

def submitDirect(self, schedd, cmd, arg, info): #pylint: disable=R0201

[...]
with open('DAGJob.jdl', 'w', encoding='utf-8') as fd:
print(jobJDL, file=fd)

The best way would be an additional parameter in crabConfiguration, not an extrajdl classAd.

As a quick patch I can parse extraJdl from info in DagmanSubmitter and add CRAB_JobReleaseTimeout if needed :-( Still ugly

[1]

Subject: 	Re: no HammerCloud jobs at Tier-1s
Date: 	Fri, 29 Nov 2024 17:33:53 +0100
From: 	Stephan Lammel <[email protected]>
To: 	Stefano Belforte <[email protected]>, Andrea Sciaba <[email protected]>, Jakrapop Akaranee <[email protected]>, Malik Shahzad Muzaffar <[email protected]>
CC: 	Chan-Anun Rungphitakchai <[email protected]>, cms-service-crab3htcondor (Announcements from IT to CRAB3-HTCondor Project) <[email protected]>


Many Thanks Stefano!

There is a remaining issue that started Wednesday when first jobs
run again: Jobs seem to be batched together and are not finishing
continuously. If you look at siteStatus, i.e.
https://cmssst.web.cern.ch/siteStatus/summary.html
you see an hour and a half of unknown status/no HC results and
then 30 minutes of ok status/many HC results. (Although Site
Readiness is shown, the pattern comes from HammerCloud. If you
click on a site, for instance
https://cmssst.web.cern.ch/siteStatus/detail.html?site=T1_FR_CCIN2P3
you see it's coming from HammerCloud.)

Was there any change made to the "release a job every 5 minute"
logic that could account for this?
(I can't exclude it's due to scheduling on SI/global pool side but the
"Submit" times on the Grafana dashboard, i.e.

https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=sciaba&var-task=241129_154348:sciaba_crab_HC-211-T1_FR_CCIN2P3-110047-20241129144101&from=1732895028000&to=now

suggests to me it's on CRAB side.)

@belforte
Copy link
Member Author

belforte commented Dec 2, 2024

three are more extraJDL used by HammerCloud, here's one config file from recent task:

from WMCore.Configuration import Configuration
config = Configuration()
config.section_('Site')
config.Site.blacklist = ['T3*']
config.Site.ignoreGlobalBlacklist = True
config.Site.whitelist = ['T1_FR_CCIN2P3']
config.Site.storageSite = 'T2_CH_CERN'
config.section_('Data')
config.Data.splitting = 'LumiBased'
config.Data.unitsPerJob = 15
config.Data.publication = False
config.Data.ignoreLocality = True
config.Data.outputDatasetTag = '0ea12bcd230936c2556840cb8452714d'
config.Data.inputDataset = '/GenericTTbar/HC-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/AODSIM'
config.Data.allowNonValidInputDataset = True
config.Data.totalUnits = 3315
config.section_('JobType')
config.JobType.priority = 600000
config.JobType.allowUndistributedCMSSW = True
config.JobType.pluginName = 'Analysis'
config.JobType.maxJobRuntimeMin = 300
config.JobType.psetName = '/data/hc/apps/cms/inputfiles/usercode/miniaodsim_9xx.py'
config.section_('Debug')
config.Debug.extraJDL = ['+CRAB_NoWNStageout=1', '+CRAB_JobReleaseTimeout=300', '+CRAB_NoWNStageout=1', '+CRAB_HC=True', 'accounting_group=highprio', 'accounting_group_user=cmsdataops']
config.section_('User')
config.User.voRole = 'production'
config.section_('General')
config.General.transferLogs = False
config.General.requestName = 'HC-211-T1_FR_CCIN2P3-110047-20241129144101'
config.General.transferOutputs = False
config.General.activity = 'hctest'

@belforte
Copy link
Member Author

belforte commented Dec 2, 2024

config.Debug.extraJDL are passed to job anyhow via Job.x.submit files.

CRAB_NoWNStageout is used in cmscp.py and PostJob. - OK , but ...could it be replaced with config.General.transferOutputs=False ?
CRAB_HC is only used in RenewRemoteProxies to skip HC tasks . Could replace with if task['tm_activity'] and 'HC' in task['tm_activity'].upper(): as used already in DagmanCreator
accounting_group and accounting_group_user are JDL commands for the submitted jobs and are OK

@belforte
Copy link
Member Author

belforte commented Dec 2, 2024

so only +CRAB_JobReleaseTimeout=300 needs to be passed to the DAG JDL, or replaced with a config. option, extending the JSON in tm_user_config like done for input_blocks or require_accelerators

belforte added a commit to belforte/CRABServer that referenced this issue Dec 2, 2024
belforte added a commit to belforte/CRABServer that referenced this issue Dec 2, 2024
belforte added a commit to belforte/CRABServer that referenced this issue Dec 2, 2024
belforte added a commit to belforte/CRABServer that referenced this issue Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant