-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of a new method to StageOutImpl to log details about failing gfal commands #12081
Conversation
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Andrea, please find a few comments along the code.
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @anpicci I think the changes look good at a first glance. I did not check in the PR description if this is the final version though, but anyway, I asked only one question inline in the same line of thought.
copyCommandDict['source'] = self.createFinalPFN(sourcePFN) | ||
copyCommandDict['destination'] = self.createFinalPFN(targetPFN) | ||
|
||
copyCommand = self.copyCommand % copyCommandDict | ||
result += copyCommand | ||
|
||
logging.debug("Actual command which failed: %s", copyCommand) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know if the command failed (retries > 0) or not (retry=0), so it is a misleading message.
echo "ERROR: gfal-copy exited with $EXIT_STATUS" | ||
echo "Source PFN: {source}" | ||
echo "Target PFN: {destination}" | ||
echo "Cleaning up failed file: {remove_command}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command is now missing the file removal logic, as you deleted that in favor of simply printing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have missed this comment.
""" | ||
result = "#!/bin/bash\n" | ||
|
||
copyCommandDict = {'checksum': '', 'options': '', 'source': '', 'destination': ''} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, the actual construction of the stage out command (without any of the exit status check and file removal) is exactly the same between this method and createStageOutCommand
. If that is correct, then I think we should separate that logic in a different method and call that method inside createStageOutCommand()
or createDebuggingCommand()
. This way we guarantee that the code/command will never diverge by accident.
Another option, but it is a larger step, would be to save the actual stage out command (without exit status check/file removal overload) in a instance variable and just access that inside createDebuggingCommand()
. This gives us even less room for potential differences in the command construction for real execution and debugging execution (plus, it avoids many internal calls).
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
test this please |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Andrea, please find some comments inline.
In addition, please review the ticket requirements. AFAICT, it is missing still 2 desired information:
- location of the gfal-cp binary (to be retrieved with
which gfal-cp
) - version of gfal-cp (please check the documentation to see how to get it, perhaps a --version)
logging.error("Maximum number of retries exhausted. Further details on the failed command reported below.") | ||
command = self.createDebuggingCommand(sourcePFN, targetPFN, options, checksums) | ||
self.executeCommand(command) | ||
raise stageOutEx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this will have to be tested, but I wonder if we should do raise stageOutEx from None
to avoid chaining exceptions? Just reading this code, I am not really sure what will happen.
echo "ERROR: gfal-copy exited with $EXIT_STATUS" | ||
echo "Source PFN: {source}" | ||
echo "Target PFN: {destination}" | ||
echo "Cleaning up failed file: {remove_command}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have missed this comment.
|
||
proxyInfo = None | ||
try: | ||
stdout, stderr, returnCode = runCommand("voms-proxy-info", timeout=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it will work either, as I am not sure it can find the proxy file. We might have to append this command with -file $X509_USER_PROXY
, but these things need to be tested.
destination=copyCommandDict['destination'], | ||
setup_info=self.setups | ||
) | ||
if proxyInfo is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like just printing the proxyInfo content in the try/except above will give log viewers a much better experience than passing it over (the actual content already retrieved) to a sub-process to execute all together. Especially if that can fail, because then it will print the whole command that failed
Jenkins results:
|
test this please |
Jenkins results:
|
Jenkins results:
|
test this please |
Jenkins results:
|
test this please |
Jenkins results:
|
2c576e8
to
3f5b620
Compare
Hi @stlammel , I have added some newlines , as it can be seen here. Regarding the stdout/stderr, I agree with your sentiment. However, changing it would require make modifications to a dependency that affects also other WMCore scripts. As a result, it is easily out of the scope of this PR, and I agreed with @amaltaro to open a new issue to fix the confusion between stdout and stderr appearing in the logs |
Jenkins results:
|
Ok, thanks Andrea! - Stephan |
@amaltaro everything should be fine with my additions in this PR, and ready to be reviewed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anpicci these changes are looking good to me.
However, I would suggest revisiting the pycodestyle report in jenkins and resolve those (it looks like all of them have been added by this PR).
Once you provide those, feel free to already squash commits accordingly.
for retryCount in range(self.numRetries + 1): | ||
try: | ||
logging.info("Running the stage out...") | ||
self.executeCommand(command) | ||
break | ||
break # This line won't be reached due to the raised error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless if there is no exception, then it gets executed.
test this please |
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
@amaltaro it should be ready now |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Andrea. It is looking better now and apparently you even make changes that were not required. However, I still see some easy fixes reported in the jenkins report section "Warnings from pycodestyle (pep8) by file" and we should make use of those.
Maybe it is something that we can improve in the contribution guidelines, giving people the relevant tools/directions on how they can check those out in their local nodes instead of relying on Jenkins.
We need this for the new agent release, so let us move on with this and keep these in mind for future contributions. Thanks!
@amaltaro I agree, there is in particular the pylint comment about f-strings that is worth to address on regular basis. It seems like using |
We/WMCore is probably picking up gfal2 from the container, right? Do we log the full container name that is being used or the gfal2 version/RPM we use? (It just came to me over night that using /usr/bin/gfal-copy may be different for the same OS container request as we select by generic names which are links that change.
|
Fixes #11731
Status
In development
Description
This PR introduces a new method for StageOutImpl.py, further defined in Backends/GFAL2Impl.py, in order to address @stlammel's request to enforce the information stored in the logs for a failing StageOut command.
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
No