Addition of a new method to StageOutImpl to log details about failing gfal commands #12081

anpicci · 2024-08-23T16:12:19Z

Status

In development

Description

This PR introduces a new method for StageOutImpl.py, further defined in Backends/GFAL2Impl.py, in order to address @stlammel's request to enforce the information stored in the logs for a failing StageOut command.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

No

cmsdmwmbot · 2024-08-23T16:23:47Z

Jenkins results:

Python3 Unit tests: failed
- 31 new failures
- 47 tests deleted
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 19 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15167/artifact/artifacts/PullRequestReport.html

amaltaro

Andrea, please find a few comments along the code.

src/python/WMCore/Storage/Backends/GFAL2Impl.py

src/python/WMCore/Storage/StageOutImpl.py

cmsdmwmbot · 2024-09-11T10:33:59Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 40 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15201/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-11T12:59:26Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 93 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 16 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15202/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-11T13:34:22Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 tests no longer failing
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 91 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15203/artifact/artifacts/PullRequestReport.html

todor-ivanov

Hi @anpicci I think the changes look good at a first glance. I did not check in the PR description if this is the final version though, but anyway, I asked only one question inline in the same line of thought.

src/python/WMCore/Storage/StageOutImpl.py

amaltaro · 2024-09-19T13:21:19Z

src/python/WMCore/Storage/Backends/GFAL2Impl.py

        copyCommandDict['source'] = self.createFinalPFN(sourcePFN)
        copyCommandDict['destination'] = self.createFinalPFN(targetPFN)

        copyCommand = self.copyCommand % copyCommandDict
        result += copyCommand

+        logging.debug("Actual command which failed: %s", copyCommand)


We don't know if the command failed (retries > 0) or not (retry=0), so it is a misleading message.

amaltaro · 2024-09-19T13:23:55Z

src/python/WMCore/Storage/Backends/GFAL2Impl.py

+                echo "ERROR: gfal-copy exited with $EXIT_STATUS"
+                echo "Source PFN: {source}"
+                echo "Target PFN: {destination}"
+                echo "Cleaning up failed file: {remove_command}"


This command is now missing the file removal logic, as you deleted that in favor of simply printing it.

You might have missed this comment.

src/python/WMCore/Storage/Backends/GFAL2Impl.py

amaltaro · 2024-09-19T13:35:19Z

src/python/WMCore/Storage/Backends/GFAL2Impl.py

+        """
+        result = "#!/bin/bash\n"
+
+        copyCommandDict = {'checksum': '', 'options': '', 'source': '', 'destination': ''}


AFAICT, the actual construction of the stage out command (without any of the exit status check and file removal) is exactly the same between this method and createStageOutCommand. If that is correct, then I think we should separate that logic in a different method and call that method inside createStageOutCommand() or createDebuggingCommand(). This way we guarantee that the code/command will never diverge by accident.

Another option, but it is a larger step, would be to save the actual stage out command (without exit status check/file removal overload) in a instance variable and just access that inside createDebuggingCommand(). This gives us even less room for potential differences in the command construction for real execution and debugging execution (plus, it avoids many internal calls).

src/python/WMCore/Storage/Backends/GFAL2Impl.py

cmsdmwmbot · 2024-09-23T11:52:12Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
- 2 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 90 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15224/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-25T08:40:04Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 10 warnings and errors that must be fixed
- 92 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15231/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-25T10:57:55Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 90 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15232/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-25T11:19:01Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 91 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15233/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-25T11:46:42Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 91 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15234/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-09-25T18:33:54Z

test this please

cmsdmwmbot · 2024-09-25T18:43:10Z

Jenkins results:

Python3 Unit tests: succeeded
- 4 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 91 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15239/artifact/artifacts/PullRequestReport.html

amaltaro

Andrea, please find some comments inline.
In addition, please review the ticket requirements. AFAICT, it is missing still 2 desired information:

location of the gfal-cp binary (to be retrieved with which gfal-cp)
version of gfal-cp (please check the documentation to see how to get it, perhaps a --version)

amaltaro · 2024-09-27T01:58:58Z

src/python/WMCore/Storage/StageOutImpl.py

+            logging.error("Maximum number of retries exhausted. Further details on the failed command reported below.")
+            command = self.createDebuggingCommand(sourcePFN, targetPFN, options, checksums)
+            self.executeCommand(command)
+            raise stageOutEx


I guess this will have to be tested, but I wonder if we should do raise stageOutEx from None to avoid chaining exceptions? Just reading this code, I am not really sure what will happen.

src/python/WMCore/Storage/Backends/GFAL2Impl.py

amaltaro · 2024-09-27T02:04:10Z

src/python/WMCore/Storage/Backends/GFAL2Impl.py

+                echo "ERROR: gfal-copy exited with $EXIT_STATUS"
+                echo "Source PFN: {source}"
+                echo "Target PFN: {destination}"
+                echo "Cleaning up failed file: {remove_command}"


You might have missed this comment.

amaltaro · 2024-09-27T02:06:29Z

src/python/WMCore/Storage/Backends/GFAL2Impl.py

+
+        proxyInfo = None
+        try:
+            stdout, stderr, returnCode = runCommand("voms-proxy-info", timeout=10)


I am not sure it will work either, as I am not sure it can find the proxy file. We might have to append this command with -file $X509_USER_PROXY, but these things need to be tested.

amaltaro · 2024-09-27T02:10:44Z

src/python/WMCore/Storage/Backends/GFAL2Impl.py

+            destination=copyCommandDict['destination'],
+            setup_info=self.setups
+        )
+        if proxyInfo is not None:


I feel like just printing the proxyInfo content in the try/except above will give log viewers a much better experience than passing it over (the actual content already retrieved) to a sub-process to execute all together. Especially if that can fail, because then it will print the whole command that failed

cmsdmwmbot · 2024-09-27T08:06:00Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 92 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15242/artifact/artifacts/PullRequestReport.html

anpicci · 2024-09-27T08:25:12Z

test this please

cmsdmwmbot · 2024-09-27T08:28:09Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 91 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 17 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15243/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-27T08:33:34Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 91 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 17 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15244/artifact/artifacts/PullRequestReport.html

anpicci · 2024-09-27T08:35:13Z

test this please

cmsdmwmbot · 2024-09-27T08:43:48Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
Python3 Pylint check: succeeded
- 89 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15245/artifact/artifacts/PullRequestReport.html

anpicci · 2024-10-16T13:42:38Z

test this please

cmsdmwmbot · 2024-10-16T13:50:50Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 88 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15319/artifact/artifacts/PullRequestReport.html

anpicci · 2024-10-16T15:46:24Z

Hi @stlammel , I have added some newlines , as it can be seen here.

Regarding the stdout/stderr, I agree with your sentiment. However, changing it would require make modifications to a dependency that affects also other WMCore scripts. As a result, it is easily out of the scope of this PR, and I agreed with @amaltaro to open a new issue to fix the confusion between stdout and stderr appearing in the logs

cmsdmwmbot · 2024-10-16T15:55:13Z

Jenkins results:

Python3 Unit tests: succeeded
- 3 changes in unstable tests
Python3 Pylint check: succeeded
- 88 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15325/artifact/artifacts/PullRequestReport.html

stlammel · 2024-10-16T16:20:52Z

Ok, thanks Andrea! - Stephan

anpicci · 2024-10-16T16:21:00Z

@amaltaro everything should be fine with my additions in this PR, and ready to be reviewed

amaltaro

@anpicci these changes are looking good to me.
However, I would suggest revisiting the pycodestyle report in jenkins and resolve those (it looks like all of them have been added by this PR).
Once you provide those, feel free to already squash commits accordingly.

amaltaro · 2024-10-17T01:43:33Z

src/python/WMCore/Storage/StageOutImpl.py

        for retryCount in range(self.numRetries + 1):
            try:
                logging.info("Running the stage out...")
                self.executeCommand(command)
-                break
+                break  # This line won't be reached due to the raised error


Unless if there is no exception, then it gets executed.

src/python/WMCore/Storage/StageOutImpl.py

src/python/WMCore/Storage/Backends/GFAL2Impl.py

test/python/WMCore_t/Storage_t/Backends_t/GFAL2Impl_t.py

anpicci · 2024-10-17T07:55:55Z

test this please

cmsdmwmbot · 2024-10-17T08:00:07Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 88 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15330/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-10-17T08:04:55Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: succeeded
- 88 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15331/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-10-17T08:56:09Z

Jenkins results:

Python3 Unit tests: failed
- 6 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 79 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15332/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-10-17T09:55:08Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 81 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 7 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15333/artifact/artifacts/PullRequestReport.html

anpicci · 2024-10-17T10:46:13Z

@amaltaro it should be ready now

cmsdmwmbot · 2024-10-17T10:55:37Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 81 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 5 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15334/artifact/artifacts/PullRequestReport.html

amaltaro

Thanks Andrea. It is looking better now and apparently you even make changes that were not required. However, I still see some easy fixes reported in the jenkins report section "Warnings from pycodestyle (pep8) by file" and we should make use of those.

Maybe it is something that we can improve in the contribution guidelines, giving people the relevant tools/directions on how they can check those out in their local nodes instead of relying on Jenkins.

We need this for the new agent release, so let us move on with this and keep these in mind for future contributions. Thanks!

anpicci · 2024-10-17T12:55:53Z

@amaltaro I agree, there is in particular the pylint comment about f-strings that is worth to address on regular basis. It seems like using .format() is not sufficient to resolve such comment, so that means that we should adopt the f"string{python_var}" approach

stlammel · 2024-10-19T16:04:38Z

We/WMCore is probably picking up gfal2 from the container, right? Do we log the full container name that is being used or the gfal2 version/RPM we use? (It just came to me over night that using /usr/bin/gfal-copy may be different for the same OS container request as we select by generic names which are links that change.

Stephan

anpicci added the Technical Debt Used to track issues that address technical needs internal to WM team label Aug 23, 2024

anpicci requested a review from amaltaro August 23, 2024 16:12

anpicci self-assigned this Aug 23, 2024

amaltaro reviewed Sep 10, 2024

View reviewed changes

anpicci requested review from amaltaro, todor-ivanov and khurtado September 11, 2024 13:36

todor-ivanov approved these changes Sep 17, 2024

View reviewed changes

src/python/WMCore/Storage/StageOutImpl.py Show resolved Hide resolved

amaltaro requested changes Sep 19, 2024

View reviewed changes

anpicci requested a review from amaltaro September 24, 2024 08:15

amaltaro requested changes Sep 27, 2024

View reviewed changes

amaltaro added PR: Do not merge yet PR: squashing needed labels Sep 27, 2024

anpicci force-pushed the devb_11731 branch 2 times, most recently from 2c576e8 to 3f5b620 Compare October 16, 2024 15:41

anpicci requested a review from amaltaro October 16, 2024 16:20

amaltaro approved these changes Oct 17, 2024

View reviewed changes

anpicci added 2 commits October 17, 2024 12:39

Augmentation of gfal-cp debuggin for StageOutImpl and GFAL2Impl scripts

aae9f0e

Adjusting the unit test scripts

b8e502b

anpicci force-pushed the devb_11731 branch from 2947409 to b8e502b Compare October 17, 2024 10:39

amaltaro approved these changes Oct 17, 2024

View reviewed changes

amaltaro removed PR: Do not merge yet PR: squashing needed Technical Debt Used to track issues that address technical needs internal to WM team labels Oct 17, 2024

amaltaro merged commit 64c182f into dmwm:master Oct 17, 2024
2 of 4 checks passed

anpicci mentioned this pull request Oct 17, 2024

Improvement of the stdout/stderr handling in the WMCore logs #12149

Open

This was referenced Dec 11, 2024

Add extra debugging information for failed stageout #9428

Closed

Enhance stage out error report #9393

Closed

Addition of a new method to StageOutImpl to log details about failing gfal commands #12081

Addition of a new method to StageOutImpl to log details about failing gfal commands #12081

Conversation

anpicci commented Aug 23, 2024

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Aug 23, 2024

amaltaro left a comment

Choose a reason for hiding this comment

cmsdmwmbot commented Sep 11, 2024

cmsdmwmbot commented Sep 11, 2024

cmsdmwmbot commented Sep 11, 2024

todor-ivanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Sep 23, 2024

cmsdmwmbot commented Sep 25, 2024

cmsdmwmbot commented Sep 25, 2024

cmsdmwmbot commented Sep 25, 2024

cmsdmwmbot commented Sep 25, 2024

todor-ivanov commented Sep 25, 2024

cmsdmwmbot commented Sep 25, 2024

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Sep 27, 2024

anpicci commented Sep 27, 2024

cmsdmwmbot commented Sep 27, 2024

cmsdmwmbot commented Sep 27, 2024

anpicci commented Sep 27, 2024

cmsdmwmbot commented Sep 27, 2024

anpicci commented Oct 16, 2024

cmsdmwmbot commented Oct 16, 2024

anpicci commented Oct 16, 2024

cmsdmwmbot commented Oct 16, 2024

stlammel commented Oct 16, 2024

anpicci commented Oct 16, 2024 • edited Loading

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anpicci commented Oct 17, 2024

cmsdmwmbot commented Oct 17, 2024

cmsdmwmbot commented Oct 17, 2024

cmsdmwmbot commented Oct 17, 2024

cmsdmwmbot commented Oct 17, 2024

anpicci commented Oct 17, 2024

cmsdmwmbot commented Oct 17, 2024

amaltaro left a comment

Choose a reason for hiding this comment

anpicci commented Oct 17, 2024

stlammel commented Oct 19, 2024

anpicci commented Oct 16, 2024 •

edited

Loading