Fix: Allow flytepropeller to list google cloud storage objects for distributed task error handling #40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With the 1.14 release, flytepropeller is able to identify the root cause error in distributed tasks like pytorch jobs by aggregating error information from all workers and identifying the error that occurred first. See RFC flyteorg/flyte#5598.
To do this, flytepropeller needs to list the error files from the different workers in the so called raw output prefix bucket of the respective execution.
In GCP, a specific permission for listing objects in buckets needs to be added to the custom IAM role for Flyte propeller.
I unfortunately don’t have access to an AWS and Azure environment to test whether changes are required there as well but from what I can see in the documentation, no changes should be required:
AWS:
Azure:
The azure module assign the broad
Storage Blob Data Owner
role: