Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Allow flytepropeller to list google cloud storage objects for distributed task error handling #40

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

fg91
Copy link
Collaborator

@fg91 fg91 commented Dec 10, 2024

With the 1.14 release, flytepropeller is able to identify the root cause error in distributed tasks like pytorch jobs by aggregating error information from all workers and identifying the error that occurred first. See RFC flyteorg/flyte#5598.

To do this, flytepropeller needs to list the error files from the different workers in the so called raw output prefix bucket of the respective execution.

In GCP, a specific permission for listing objects in buckets needs to be added to the custom IAM role for Flyte propeller.


I unfortunately don’t have access to an AWS and Azure environment to test whether changes are required there as well but from what I can see in the documentation, no changes should be required:

AWS:

[…] the s3:ListBucket permission (assigned by the IaC in this repo) allows the user to use the Amazon S3 ListObjectsV2 operation. (source)

[The ListObjectsV2 operation] returns some or all (up to 1,000) of the objects in a bucket with each request. (source)

Azure:

The azure module assign the broad Storage Blob Data Owner role:

Provides full access to Azure Storage blob containers and data (source)

…ects for distributed task error handling

Signed-off-by: Fabio Grätz <[email protected]>
@fg91 fg91 requested a review from davidmirror-ops December 10, 2024 20:21
@fg91 fg91 self-assigned this Dec 10, 2024
@fg91 fg91 added the bug Something isn't working label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants