Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submission is broken when asking for GPU #8784

Closed
belforte opened this issue Nov 12, 2024 · 7 comments · Fixed by #8796
Closed

submission is broken when asking for GPU #8784

belforte opened this issue Nov 12, 2024 · 7 comments · Fixed by #8796

Comments

@belforte
Copy link
Member

belforte commented Nov 12, 2024

see https://cms-talk.web.cern.ch/t/crab-jobs-requesting-gpu-stay-idle-forever/61932/1

The problem is that the initial dag bootstrap job submitted to scheduler universe requires one GPU.

Need to convert "Request_GPUs" to "CRAB_Request_GPUs".

belforte@vocms0199/~> condor_q 72591015 -af crab_reqname jobuniverse RequestGPUs RequiresGPU
241112_124253:alherrer_crab_gpu_test_job 7 1 1
belforte@vocms0199/~> condor_q 72591015 -l |grep "Requirements ="
Requirements = (true) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.GPUs >= RequestGPUs)
belforte@vocms0199/~> 

so the dag boostrap stay idle forever

@belforte
Copy link
Member Author

info['accelerator_jdl'] = '+RequiresGPU=1\nrequest_GPUs=1'

@belforte
Copy link
Member Author

note this
#6989 (comment)

maybe we need to keep Request_GPUs in the Job.submit file but make sure it does not go in the dagboostrap submission
Not sure about RequiresGPU.

@belforte
Copy link
Member Author

maybe all of this is useless

# extra_jdl and accelerator_jdl are not listed in SUBMIT_INFO
# and need ad-hoc handling, those are a string of `\n` separated k=v elements
if 'extra_jdl' in info and info['extra_jdl']:
for keyValue in info['extra_jdl'].split('\n'):
adName, adVal = keyValue.split(sep='=', maxsplit=1)
# remove any user-inserted spaces which would break schedd.submit #8420
adName = adName.strip()
adVal = adVal.strip()
jdl[adName] = adVal
if 'accelerator_jdl' in info and info['accelerator_jdl']:
for keyValue in info['accelerator_jdl'].split('\n'):
adName, adVal = keyValue.split(sep='=', maxsplit=1)
# these are built in our code w/o extra spaces
jdl[adName] = adVal

Note this !

## These are the CRAB attributes that we want to add to the job class ad when
## using the submitDirect() method.
# SB: do we really need all of this ? Most of them are in Job.submit created by
# DagmanCreator and are not needed to submit/run the DagMan.
SUBMIT_INFO = [ \

Lack of cleanup strikes back :-(

@belforte
Copy link
Member Author

for reference, here's the user's config file

config = config()

# General settings
config.General.requestName = 'gpu_test_job'
config.General.workArea = 'testcrabgpu_nov12_1'
config.General.transferOutputs = True
config.General.transferLogs = True

# JobType settings
config.JobType.pluginName = 'PrivateMC'
config.JobType.psetName = 'PSet.py'  
config.JobType.allowUndistributedCMSSW = True 
config.JobType.scriptExe = './run_job.sh'  # Shell script that runs the Python job
config.JobType.inputFiles = ['gpu_test.py', 'run_job.sh', 'FrameworkJobReport.xml']  # Include Python code and shell script

config.JobType.outputFiles = ['gpu_output.txt']  # Expected output file

config.JobType.maxMemoryMB = 2000 
config.JobType.maxJobRuntimeMin = 100  

config.Data.outputPrimaryDataset = 'GPU_Test_Dataset'
config.Data.splitting = 'EventBased'  # Splitting type for non-CMSSW jobs
config.Data.unitsPerJob = 1  
config.Data.totalUnits = 1  
#config.Data.outLFNDirBase = '/store/user/aherrera'  # Output directory for job results
config.Data.publication = False 
#config.Data.secondaryInputFiles = ['root://cmseos.fnal.gov//store/user/aherrera/JOBMERGED/ttboosted/ttboosted_01/tt_jj0p5.root']

# Site settings
config.section_("Site")
config.Site.storageSite = 'T3_US_FNALLPC'
#config.Site.whitelist = ['T2_US_Caltech', 'T2_US_Florida', 'T2_US_Purdue', 'T2_US_Wisconsin']
config.Site.requireAccelerator = True  # Specify supports GPUs

@belforte
Copy link
Member Author

removing the lines indicated above made dag bootstrap run and submit jobs.
But my test submission is not getting matched in the global pool.

I have asked SI for help: https://mattermost.web.cern.ch/cms-o-and-c/pl/yi4eoususjgo8gg8k616qu6m9r

@belforte
Copy link
Member Author

there is some special problem with KIT. Once I extended the possible site list job ran immediately at T2_US_Wisconsin.
The fact that it was restricted to KIT was due to the current dysfunctional JobRouter. I turned it off.

@belforte
Copy link
Member Author

closed via #8796

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant