Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GPU support #6989

Closed
belforte opened this issue Jan 21, 2022 · 15 comments
Closed

add GPU support #6989

belforte opened this issue Jan 21, 2022 · 15 comments
Assignees
Labels

Comments

@belforte
Copy link
Member

while this may be covered in #6534 it is better to try to make sure that CRAB can be used with same flexibility/constraints as production. I.e. align with description in https://github.com/dmwm/WMCore/wiki/GPU-Support

@belforte belforte self-assigned this Jan 21, 2022
@belforte
Copy link
Member Author

This is also a good item for a new person.

@novicecpp
Copy link
Contributor

Last week, Katy, Dario, and I has start kick off this task. After asked people around (thanks Nikos, Antonio, and Sam), this is what we know:

We need to specify at least 3 classads, to match gpu node.

  • RequiresGPU=1, match GPU node.
  • RequestGPUs=1, match node that have n GPUs
  • DESIRED_Sites a.k.a. Site.whitelist in crab config. If users do not specify whitelist, crab will not put DESIRED_Sites in classads.

We have 2 crab config to test submit gpu task in crab:

  • cudahelloworld from patatrack docs, built on lxplus-gpu, shipping binary and run with custom script.
  • pset hltSimpleGPUConfig.py from Sam Harper.
    • Example crab task
    • CMSSW with this pset can run in both gpu and non gpu node (confirmed with Sam).
      • In GPU node, output of CMSSW in TrigReport ---------- Path Summary ------------ look like this:
        == CMSSW: TrigReport ---------- Path   Summary ------------
        == CMSSW: TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
        == CMSSW: TrigReport     1    0        100          0        100          0 Status_OnGPU
        == CMSSW: TrigReport     1    1        100          0        100          0 DQM_EcalReconstruction_v3
        
      • In Non-GPU node, output will look like this instead:
        == CMSSW: TrigReport ---------- Path   Summary ------------
        == CMSSW: TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
        == CMSSW: TrigReport     1    0        100        100          0          0 Status_OnGPU
        == CMSSW: TrigReport     1    1        100        100          0          0 DQM_EcalReconstruction_v3
        

Other reference:

@novicecpp
Copy link
Contributor

In my opinion, user can use GPU in 3 ways,

  • CMSSW builtin code (hltSimpleGPUConfig.py for example),
    • RequiresGPU configuration option should be enough for crab user and crab should provide correct classads for them.
      • With heavy assumption that all GPU code path can run on cpu instead.
        • Then, what is actual requirement of CMSSW to run on gpu?
  • CMSSW custom plugin.
    • I do not know how to build custom plugin, do they need to align requirement of gpu with main CMSSW? If yes, we can threat it like CMSSW builtin code.
  • Custom scripts: explicit provide cuda binary and custom script in crab config.
    • This one (maybe CMSSW custom plugin too), is for advance users, users should know what they are doing and supply GPU requirement correctly. We only need documents and example for this.

@novicecpp
Copy link
Contributor

novicecpp commented Sep 21, 2022

Next action items:

  • Discuss with CMSSW developer and users what is the requirement of CMSSW and custom plugin/binary when running on GPU.
  • More test
    • cudahelloworld, HC dataset with blacklist of usual site [ 'T2_CH_CERN', 'T2_US_Wisconsin' ]
    • explore how -gencode work with nvcc command; can we run -gencode arch=compute_70,code=sm_70 binary on node with CUDACapabilities other than 7.0?.
    • explore optional(?) classads: CUDACapability, CUDARuntime, GPUMemoryMB

@novicecpp
Copy link
Contributor

novicecpp commented Sep 23, 2022

Summary from yesterday's tested:

  1. Whitelist all GPU node except T2_CH_CERN and T2_US_Wisconsin
    1. Task 220922_152424:tseethon_crab_20220922_172420
    • Jobs has distributed to many site as usual (failed jobs come from sites has stub library error).
  2. Some node has CUDA driver is a stub library error
    1. T2_UK_London_IC (cap 8.0) job_out.6.0.txt
      • Job has error == CMSSW: cudaErrorStubLibrary: CUDA driver is a stub library, which is same logs when run in normal lxplus machine.
    2. T1_DE_KIT, Sam's Task, node is cap 8.0 job_out.1.0.txt
  3. Test nvcc gencode, with gencode 7.0,8.0, and 7.0+8.0.
    I compiled 3 version of cudahelloworld, with 7.0,8.0, and 7.0+8.0, and run all binary against machine with arbitary CUDACapability.
    1. T3_UK_London_QMUL (cap 8.0) job_out.1.0.txt
      • 7.0 failed with error == CMSSW: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device.
      • 8.0 run fine.
      • 7.0+8.0 run fine.
    2. T2_US_MIT (cap 7.0) job_out.4.0.txt
      • 7.0 run fine.
      • 8.0 broke, error logs are same as jobs in T3_UK_London_QMUL.
      • 7.0+8.0 run fine.
    3. T2_CH_CERN (cap 7.5) job_out.5.0.txt
      • 7.0 run fine.
      • 8.0 broke, error logs are same as jobs in T3_UK_London_QMUL.
      • 7.0+8.0 run fine.
  4. explore optional classads
    1. CUDACapability="6.0,6.1,7.0"220922_154933%3Atseethon_crab_20220922_174929
    2. CUDACapability="8.0" 220922_161815:tseethon_crab_20220922_181811
      • All job match the machine that does not have GPU cap 8.0

@novicecpp
Copy link
Contributor

novicecpp commented Oct 13, 2022

This week we have discussion with SI and GPU Dev about support GPU in CRAB.

Let me summary all info and agreement so far:

  • We provide GPUParams same as what WMAgent support in https://github.com/dmwm/WMCore/wiki/GPU-Support
    • But we do not provide any default value. Users should know what they want to submit.
      • Or until we have more use-case or problems that we need to enforce.
    • GPU Params suggestions from Andrea Bocci
      • CUDACApability >= 6 as the only requirement set by default, and allow the users to change it (if they are not using CMSSW or if they rebuilt it for a different capability).
      • CUDARuntime match CUDARuntimes (maybe CMS_CUDA_SUPPORTED_RUNTIMES?) published by the sites.
      • let users specify the GPUMemory or GPUName, if they wish to
      • CUDADriverVersiona is useful in order to avoid bug in driver.
  • We will not guarantee if params users provide will match any resources due of technical issues,
    • It is not make sense to implement matchmaking in CRAB which already done in condor system.
    • We can test if JDL will match any resources but it not possible for now. Info of GPU (e.g. CUDACapability is getting after pilot start running.
      • Marco: condor_submit -dryrun takes a jdl and spits out the job classads as if it was submitted to the schedd. Then condor_q --better --jobads job --slotads sl 673.0 is simulating the SECOND matchmaking
      • More detail in MM link above.
  • We check/enforce siteWhitelist to GPU site if requireAccelerator exists.
    • By query gpu site from condor_status as Marco's suggested
    condor_status -any -pool vocms0207.cern.ch -const 'mytype=="glidefactory" && stringlistmember("CMSGPU", GLIDEIN_Supported_VOs)' -af GLIDEIN_CMSSite | sort | uniq
    
  • Possible to add another periodic_remove rules, if request_GPUs is exist, we wiil failed the job in 2 or 3 days instead of usual 7 days. We wait to see if needed.
    • How we differentiate between "job never match" and "job is match but now no resource left"?
      • Let me quote Stefano chat:
        We do not at the moment. This event was so rare until now that there was no need. Let's wait what things will shape out like in the real world before we worry.
        
  • If a user has a locally built CMSSW (possibly with modified GPU code), scram will build support for all CUDA devices that are supported by the release.
  • About CUDA Runtime
    • CUDA software has 3 components:
      • CUDA Runtime = libcudart.so (e.g. 10.x, 11.x)
      • CUDA Driver libcuda.so,
      • kernel driver (expose driver version number e.g. 450.80.02).
    • IIUC, pilot expose list of runtime version can run on GPU via CMS_CUDA_SUPPORTED_RUNTIMES.
    • CMSSW ship with CUDA Runtime. Use kernel driver from host. But, not sure about CUDA Driver library.
    • For ref: https://docs.nvidia.com/deploy/cuda-compatibility/index.html

@belforte
Copy link
Member Author

thanks Wa. I have edited a bit the part about periodic_remove, let me know if you notice something wrong.

@novicecpp
Copy link
Contributor

novicecpp commented Oct 19, 2022

I got confirmation from Todor about GPU params that

  • 3 GPU params (GPUMemoryMB, CUDACapabilities, CUDARuntime) is not implement JDL translation in WMCore yet.
  • 3 mandatory params (GPUName, CUDADriverVersion, CUDARuntimeVersion) is put in final JDL BUT not use for in matchmaking (yet). DMWM team wait for SI to implement it (but last time I asked Marco, SI plan to do but let see how it goes first).

So, what should we do? Should we wait for SI or put GPU params to Requirements attribute ourselves?

Maybe tomorrow O&C week, we will have more information from Antonio (SI team) and more use case from Charis (our GPU user).

@amaltaro
Copy link

3 GPU params (GPUMemoryMB, CUDACapabilities, CUDARuntime) is not implement JDL translation in WMCore yet.
3 mandatory params (GPUName, CUDADriverVersion, CUDARuntimeVersion) is put in final JDL BUT not use for in

it's actually the reverse.
Supported parameters are: GPUMemoryMB, CUDACapabilities, CUDARuntime, while the other 3 have not yet been implemented.

@novicecpp
Copy link
Contributor

Thanks Alan!

@belforte
Copy link
Member Author

@novicecpp we should do like WMA does and keep in touch :-)

Sounds like everybody expects changes sometimes, but nobody is sure what and when. Hey.. new technology (at least for HEP), we are on the bleeding edge.
We do like WMA not only because we are friend with @amaltaro , nor only because we lack the intelligence and creativity to do better, but also (mostly?) to make life easier for SI by presenting them a common set of requirements.

I still think that for long term happyness we should make user specify parameters they want to change with a k,v dictionary in crabConfig, not plan to add named parameter for each possible new relevant requirement.

@novicecpp
Copy link
Contributor

Let me add a reference about double matchmaking in condor from SI team (Fall O&C week, 2022).
https://indico.cern.ch/event/1126680/contributions/5084297/attachments/2531873/4356376/20221020_Scheduling_jobs_GPUs_with_SI.pdf

After releasing Site.requireAccelerator, I think we may need to introduce GPUs Catalogs to users somehow, to let users know that it "possible" (but in the future) to request specific GPU for CRAB task.
"In the future" mean we need concrete use case from users on why they need this to convince ourselves and SI to support it.

@belforte
Copy link
Member Author

shall we close this at this point ?

@novicecpp
Copy link
Contributor

Not yet. Until we show some examples to user :)

Move from minutes: got 2 GPU examples from @ckoraka and @@AdrianoDee
https://gitlab.cern.ch/tseethon/public-shared/-/tree/master/crab_submit/gpu_tutorial
Wa will write a tutorial for it.

@novicecpp
Copy link
Contributor

Closing this issue in favor of #7645.
We will follow up dmwm/WMCore#11595 for better job matching in the same manner as wmagent does and open a new issue to implement in crab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants