add GPU support #6989

belforte · 2022-01-21T13:07:51Z

while this may be covered in #6534 it is better to try to make sure that CRAB can be used with same flexibility/constraints as production. I.e. align with description in https://github.com/dmwm/WMCore/wiki/GPU-Support

belforte · 2022-01-21T13:09:13Z

This is also a good item for a new person.

novicecpp · 2022-09-21T16:41:58Z

Last week, Katy, Dario, and I has start kick off this task. After asked people around (thanks Nikos, Antonio, and Sam), this is what we know:

We need to specify at least 3 classads, to match gpu node.

RequiresGPU=1, match GPU node.
RequestGPUs=1, match node that have n GPUs
DESIRED_Sites a.k.a. Site.whitelist in crab config. If users do not specify whitelist, crab will not put DESIRED_Sites in classads.

We have 2 crab config to test submit gpu task in crab:

cudahelloworld from patatrack docs, built on lxplus-gpu, shipping binary and run with custom script.
- Example crab task: 220914_154126:tseethon_crab_20220914_174124

pset hltSimpleGPUConfig.py from Sam Harper.

Example crab task
- GPU node: 220915_142930:tseethon_crab_20220915_162927
- Non GPU node: 220915_142930:tseethon_crab_20220915_162927,

CMSSW with this pset can run in both gpu and non gpu node (confirmed with Sam).

In GPU node, output of CMSSW in TrigReport ---------- Path Summary ------------ look like this:

== CMSSW: TrigReport ---------- Path   Summary ------------
== CMSSW: TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
== CMSSW: TrigReport     1    0        100          0        100          0 Status_OnGPU
== CMSSW: TrigReport     1    1        100          0        100          0 DQM_EcalReconstruction_v3

In Non-GPU node, output will look like this instead:

== CMSSW: TrigReport ---------- Path   Summary ------------
== CMSSW: TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
== CMSSW: TrigReport     1    0        100        100          0          0 Status_OnGPU
== CMSSW: TrigReport     1    1        100        100          0          0 DQM_EcalReconstruction_v3

Other reference:

Slide from Nikos https://docs.google.com/presentation/d/1HE-L5Gb62D7_9zG0IF1jZe4VuV-U2tAPbrn3YguUwYY/edit#slide=id.gf98bd8a9df_1_0
GPUs monitor: https://monit-grafana.cern.ch/d/2qoPfS0Mz/cms-submission-infrastructure-gpus-monitor?orgId=11&from=now-2d&to=now
number of CRAB jobs that require a GPU, grouped by user: https://monit-grafana.cern.ch/d/Ro4kJqM4z/user-dmapelli-crab-gpu-jobs-202209?orgId=11&from=now-7d&to=now

List of sites that have a GPU node (extracted from GPUs monitor)

config.Site.whitelist = [ 'T1_DE_KIT', 'T2_CH_CERN', 'T2_CH_CSCS', 'T2_UK_London_IC', 'T2_US_Florida', 'T2_US_MIT', 'T2_US_Purdue', 'T2_US_Vanderbilt', 'T2_US_Wisconsin', 'T3_UK_London_QMUL', 'T3_US_NotreDame' ]

novicecpp · 2022-09-21T16:46:11Z

In my opinion, user can use GPU in 3 ways,

CMSSW builtin code (hltSimpleGPUConfig.py for example),
- RequiresGPU configuration option should be enough for crab user and crab should provide correct classads for them.
  - With heavy assumption that all GPU code path can run on cpu instead.
    - Then, what is actual requirement of CMSSW to run on gpu?
CMSSW custom plugin.
- I do not know how to build custom plugin, do they need to align requirement of gpu with main CMSSW? If yes, we can threat it like CMSSW builtin code.
Custom scripts: explicit provide cuda binary and custom script in crab config.
- This one (maybe CMSSW custom plugin too), is for advance users, users should know what they are doing and supply GPU requirement correctly. We only need documents and example for this.

novicecpp · 2022-09-21T16:49:09Z

Next action items:

Discuss with CMSSW developer and users what is the requirement of CMSSW and custom plugin/binary when running on GPU.
More test
- cudahelloworld, HC dataset with blacklist of usual site [ 'T2_CH_CERN', 'T2_US_Wisconsin' ]
- explore how -gencode work with nvcc command; can we run -gencode arch=compute_70,code=sm_70 binary on node with CUDACapabilities other than 7.0?.
- explore optional(?) classads: CUDACapability, CUDARuntime, GPUMemoryMB

novicecpp · 2022-09-23T12:20:02Z

Summary from yesterday's tested:

Whitelist all GPU node except T2_CH_CERN and T2_US_Wisconsin
1. Task 220922_152424:tseethon_crab_20220922_172420
- Jobs has distributed to many site as usual (failed jobs come from sites has stub library error).
Some node has CUDA driver is a stub library error
1. T2_UK_London_IC (cap 8.0) job_out.6.0.txt
  - Job has error == CMSSW: cudaErrorStubLibrary: CUDA driver is a stub library, which is same logs when run in normal lxplus machine.
2. T1_DE_KIT, Sam's Task, node is cap 8.0 job_out.1.0.txt
  - It run on CPU!! (Status_OnGPU are all fail).
    - I also tested with cudahelloworld task and it has stub library's error same as T2_UK_London_IC.
    - Compare to job that actually run on GPU job_out.1.0.txt from 220915_140753:tseethon_crab_20220915_160750
Test nvcc gencode, with gencode 7.0,8.0, and 7.0+8.0.
I compiled 3 version of cudahelloworld, with 7.0,8.0, and 7.0+8.0, and run all binary against machine with arbitary CUDACapability.
1. T3_UK_London_QMUL (cap 8.0) job_out.1.0.txt
  - 7.0 failed with error == CMSSW: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device.
  - 8.0 run fine.
  - 7.0+8.0 run fine.
2. T2_US_MIT (cap 7.0) job_out.4.0.txt
  - 7.0 run fine.
  - 8.0 broke, error logs are same as jobs in T3_UK_London_QMUL.
  - 7.0+8.0 run fine.
3. T2_CH_CERN (cap 7.5) job_out.5.0.txt
  - 7.0 run fine.
  - 8.0 broke, error logs are same as jobs in T3_UK_London_QMUL.
  - 7.0+8.0 run fine.
explore optional classads
1. CUDACapability="6.0,6.1,7.0"220922_154933%3Atseethon_crab_20220922_174929
  - job_out.68.2.txt T2_CH_CERN cap is 7.5, maybe they ignore minor version?
  - job_out.19.2.txtT3_UK_London_QMULis cap 8.0.
2. CUDACapability="8.0" 220922_161815:tseethon_crab_20220922_181811
  - All job match the machine that does not have GPU cap 8.0

novicecpp · 2022-10-13T09:10:16Z

This week we have discussion with SI and GPU Dev about support GPU in CRAB.

Let me summary all info and agreement so far:

We provide GPUParams same as what WMAgent support in https://github.com/dmwm/WMCore/wiki/GPU-Support
- But we do not provide any default value. Users should know what they want to submit.
  - Or until we have more use-case or problems that we need to enforce.
- GPU Params suggestions from Andrea Bocci
  - CUDACApability >= 6 as the only requirement set by default, and allow the users to change it (if they are not using CMSSW or if they rebuilt it for a different capability).
  - CUDARuntime match CUDARuntimes (maybe CMS_CUDA_SUPPORTED_RUNTIMES?) published by the sites.
  - let users specify the GPUMemory or GPUName, if they wish to
  - CUDADriverVersiona is useful in order to avoid bug in driver.
We will not guarantee if params users provide will match any resources due of technical issues,
- It is not make sense to implement matchmaking in CRAB which already done in condor system.
- We can test if JDL will match any resources but it not possible for now. Info of GPU (e.g. CUDACapability is getting after pilot start running.
  - Marco: condor_submit -dryrun takes a jdl and spits out the job classads as if it was submitted to the schedd. Then condor_q --better --jobads job --slotads sl 673.0 is simulating the SECOND matchmaking
  - More detail in MM link above.

We check/enforce siteWhitelist to GPU site if requireAccelerator exists.

By query gpu site from condor_status as Marco's suggested

condor_status -any -pool vocms0207.cern.ch -const 'mytype=="glidefactory" && stringlistmember("CMSGPU", GLIDEIN_Supported_VOs)' -af GLIDEIN_CMSSite | sort | uniq

Possible to add another periodic_remove rules, if request_GPUs is exist, we wiil failed the job in 2 or 3 days instead of usual 7 days. We wait to see if needed.
- How we differentiate between "job never match" and "job is match but now no resource left"?
  - Let me quote Stefano chat:
```
We do not at the moment. This event was so rare until now that there was no need. Let's wait what things will shape out like in the real world before we worry.
```
If a user has a locally built CMSSW (possibly with modified GPU code), scram will build support for all CUDA devices that are supported by the release.
About CUDA Runtime
- CUDA software has 3 components:
  - CUDA Runtime = libcudart.so (e.g. 10.x, 11.x)
  - CUDA Driver libcuda.so,
  - kernel driver (expose driver version number e.g. 450.80.02).
- IIUC, pilot expose list of runtime version can run on GPU via CMS_CUDA_SUPPORTED_RUNTIMES.
- CMSSW ship with CUDA Runtime. Use kernel driver from host. But, not sure about CUDA Driver library.
- For ref: https://docs.nvidia.com/deploy/cuda-compatibility/index.html

belforte · 2022-10-13T16:31:58Z

thanks Wa. I have edited a bit the part about periodic_remove, let me know if you notice something wrong.

novicecpp · 2022-10-19T16:16:41Z

I got confirmation from Todor about GPU params that

3 GPU params (GPUMemoryMB, CUDACapabilities, CUDARuntime) is not implement JDL translation in WMCore yet.
3 mandatory params (GPUName, CUDADriverVersion, CUDARuntimeVersion) is put in final JDL BUT not use for in matchmaking (yet). DMWM team wait for SI to implement it (but last time I asked Marco, SI plan to do but let see how it goes first).

So, what should we do? Should we wait for SI or put GPU params to Requirements attribute ourselves?

Maybe tomorrow O&C week, we will have more information from Antonio (SI team) and more use case from Charis (our GPU user).

amaltaro · 2022-10-19T16:22:14Z

3 GPU params (GPUMemoryMB, CUDACapabilities, CUDARuntime) is not implement JDL translation in WMCore yet.
3 mandatory params (GPUName, CUDADriverVersion, CUDARuntimeVersion) is put in final JDL BUT not use for in

it's actually the reverse.
Supported parameters are: GPUMemoryMB, CUDACapabilities, CUDARuntime, while the other 3 have not yet been implemented.

novicecpp · 2022-10-19T16:38:41Z

Thanks Alan!

belforte · 2022-10-19T16:41:25Z

@novicecpp we should do like WMA does and keep in touch :-)

Sounds like everybody expects changes sometimes, but nobody is sure what and when. Hey.. new technology (at least for HEP), we are on the bleeding edge.
We do like WMA not only because we are friend with @amaltaro , nor only because we lack the intelligence and creativity to do better, but also (mostly?) to make life easier for SI by presenting them a common set of requirements.

I still think that for long term happyness we should make user specify parameters they want to change with a k,v dictionary in crabConfig, not plan to add named parameter for each possible new relevant requirement.

novicecpp · 2022-10-27T15:27:12Z

Let me add a reference about double matchmaking in condor from SI team (Fall O&C week, 2022).
https://indico.cern.ch/event/1126680/contributions/5084297/attachments/2531873/4356376/20221020_Scheduling_jobs_GPUs_with_SI.pdf

After releasing Site.requireAccelerator, I think we may need to introduce GPUs Catalogs to users somehow, to let users know that it "possible" (but in the future) to request specific GPU for CRAB task.
"In the future" mean we need concrete use case from users on why they need this to convince ourselves and SI to support it.

belforte · 2023-02-23T08:14:19Z

shall we close this at this point ?

novicecpp · 2023-03-23T10:01:18Z

Not yet. Until we show some examples to user :)

Move from minutes: got 2 GPU examples from @ckoraka and @@AdrianoDee
https://gitlab.cern.ch/tseethon/public-shared/-/tree/master/crab_submit/gpu_tutorial
Wa will write a tutorial for it.

novicecpp · 2023-05-25T07:20:25Z

Closing this issue in favor of #7645.
We will follow up dmwm/WMCore#11595 for better job matching in the same manner as wmagent does and open a new issue to implement in crab.

belforte added Area: Needed Feature flagged as such by Physics Coordination or similar Priority: High Type: Enhancement Status: Available labels Jan 21, 2022

belforte self-assigned this Jan 21, 2022

belforte mentioned this issue Mar 3, 2022

add support for GPU request #6194

Closed

novicecpp assigned novicecpp and unassigned belforte Sep 20, 2022

This was referenced Oct 6, 2022

New config for request GPU nodes, Site.requireAccelerator #7422

Merged

Site.requireAccelerator dmwm/CRABClient#5174

Merged

This was referenced Nov 24, 2022

new Site.acceleratorParams dmwm/CRABClient#5179

Merged

New Site.acceleratorParams as json-like dictionary #7470

Merged

belforte removed the Status: Available label Jan 19, 2023

novicecpp added the Status: In Progress label Jan 19, 2023

novicecpp mentioned this issue May 25, 2023

Write GPU tutorial to Twiki #7645

Closed

novicecpp closed this as completed May 25, 2023

belforte mentioned this issue Nov 12, 2024

submission is broken when asking for GPU #8784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add GPU support #6989

add GPU support #6989

belforte commented Jan 21, 2022

belforte commented Jan 21, 2022

novicecpp commented Sep 21, 2022

novicecpp commented Sep 21, 2022

novicecpp commented Sep 21, 2022 •

edited

Loading

novicecpp commented Sep 23, 2022 •

edited

Loading

novicecpp commented Oct 13, 2022 •

edited by belforte

Loading

belforte commented Oct 13, 2022

novicecpp commented Oct 19, 2022 •

edited

Loading

amaltaro commented Oct 19, 2022

novicecpp commented Oct 19, 2022

belforte commented Oct 19, 2022

novicecpp commented Oct 27, 2022

belforte commented Feb 23, 2023

novicecpp commented Mar 23, 2023

novicecpp commented May 25, 2023

add GPU support #6989

add GPU support #6989

Comments

belforte commented Jan 21, 2022

belforte commented Jan 21, 2022

novicecpp commented Sep 21, 2022

novicecpp commented Sep 21, 2022

novicecpp commented Sep 21, 2022 • edited Loading

novicecpp commented Sep 23, 2022 • edited Loading

novicecpp commented Oct 13, 2022 • edited by belforte Loading

belforte commented Oct 13, 2022

novicecpp commented Oct 19, 2022 • edited Loading

amaltaro commented Oct 19, 2022

novicecpp commented Oct 19, 2022

belforte commented Oct 19, 2022

novicecpp commented Oct 27, 2022

belforte commented Feb 23, 2023

novicecpp commented Mar 23, 2023

novicecpp commented May 25, 2023

novicecpp commented Sep 21, 2022 •

edited

Loading

novicecpp commented Sep 23, 2022 •

edited

Loading

novicecpp commented Oct 13, 2022 •

edited by belforte

Loading

novicecpp commented Oct 19, 2022 •

edited

Loading