Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[15_0_X] Use HardwareResourcesDescription in ProcessConfiguration #47416

Open
wants to merge 12 commits into
base: CMSSW_15_0_X
Choose a base branch
from

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Feb 20, 2025

PR description:

Backport of #47280, #47355, #47473 (well, backported by dropping out the commit reverted there), and #47477 (the commit of the PR split in two, one to replace the commit dropped in the earlier step, and the rest included in the last commit of this PR).

Resolves cms-sw/framework-team#1248

PR validation:

Code compiles (plus the tests in #47280 and #47355)

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Backport of #47280 and #47355

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for CMSSW_15_0_X.

It involves the following packages:

  • DQM/SiStripMonitorHardware (dqm)
  • DQMServices/FwkIO (dqm)
  • DataFormats/Provenance (core)
  • FWCore/AbstractServices (core)
  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • FWCore/Services (core)
  • FWCore/Sources (core)
  • FWCore/TestProcessor (core)
  • FWCore/Utilities (core)
  • GeneratorInterface/LHEInterface (generators)
  • HeterogeneousCore/CUDAServices (heterogeneous)
  • HeterogeneousCore/ROCmServices (heterogeneous)
  • IOPool/Common (core)
  • IOPool/Input (core)
  • IOPool/SecondaryInput (core)
  • IOPool/Streamer (core)
  • Mixing/Base (simulation)
  • PhysicsTools/PyTorch (ml)
  • PhysicsTools/TensorFlow (ml)

@Dr15Jones, @antoniovagnerini, @bbilin, @civanch, @cmsbuild, @fwyzard, @kpedro88, @lviliani, @makortel, @mdhildreth, @menglu21, @mkirsano, @rseidita, @smuzaffar, @valsdav, @y19y19 can you please review it and eventually sign? Thanks.
@alberto-sanchez, @arossi83, @barvic, @fabiocos, @felicepantaleo, @fioriNTU, @fwyzard, @idebruyn, @jandrea, @missirol, @mkirsano, @mmusich, @richa2710, @riga, @rovere, @sroychow, @threus, @wddgit this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 20, 2025

cms-bot internal usage

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 220KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9d1efb/44545/summary.html
COMMIT: c8326da
CMSSW: CMSSW_15_0_X_2025-02-20-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47416/44545/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4018889
  • DQMHistoTests: Total failures: 68
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4018801
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 218 log files, 189 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 869
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52202
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

CPU comparison differences are related to #47071

GPU comparison differences look compatible with the non-reproducibilities in the pixel code

@cmsbuild
Copy link
Contributor

Pull request #47416 was updated. @Dr15Jones, @cmsbuild, @makortel, @smuzaffar, @valsdav, @y19y19 can you please check and sign again.

@makortel
Copy link
Contributor Author

I took out the commit that altered the edmProvDump output. I opened #47473 to revert the corresponding commit from master so we can check that it is sufficient to get CRAB working again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests GpuUnitTests
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9d1efb/44723/summary.html
COMMIT: 29ae1f5
CMSSW: CMSSW_15_0_X_2025-02-27-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47416/44723/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 2 errors in the following unit tests:

---> test TestIOPoolInputReducedProcessHistoryHardwareResources had ERRORS
---> test TestIOPoolStreamerReducedProcessHistoryHardwareResources had ERRORS

GPU Unit Tests

I found 1 errors in the following unit tests:

---> test testTorchSimpleDnnCUDA had ERRORS

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 380
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52691
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@makortel makortel force-pushed the processConfigurationHardwareResourcesDescription_150x branch from 29ae1f5 to 4c6ccf4 Compare February 28, 2025 19:18
@cmsbuild
Copy link
Contributor

Pull request #47416 was updated. @Dr15Jones, @cmsbuild, @makortel, @smuzaffar, @valsdav, @y19y19 can you please check and sign again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@makortel
Copy link
Contributor Author

Ok, now this PR corresponds to the status in the master branch (with a bit different history though), and should work for both CRAB and the unit tests.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9d1efb/44754/summary.html
COMMIT: 4c6ccf4
CMSSW: CMSSW_15_0_X_2025-02-28-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47416/44754/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53071
  • DQMHistoTests: Total failures: 877
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52194
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@valsdav
Copy link
Contributor

valsdav commented Mar 3, 2025

+ml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants