Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow jobs failing when temp_storage_location is on bucket with non-persistent retention #74

Open
jbusecke opened this issue May 11, 2023 · 2 comments

Comments

@jbusecke
Copy link
Contributor

@cisaacstern and I just debugged a jobsubmission over here. It turns out that the job would fail if the job writes to a non-persistent bucket:

The bucket in question was gs://leap-scratch (set up by 2i2c for the leap-stc org) and the error was:

Workflow failed. Causes: Unable to create directory: gs://leap-scratch/data-library/temp/gh-leap-stc-data-management-da1b838-1683838917.1683838945.611305/dax-tmp-2023-05-11_14_02_30-7266315841491564283-S05-0-dcd150a5967231a.

Changing both temp_storage_location and the cache location to the persistent bucket fixed this issue.

Weird things 🤪

  • The service account used to deploy the job from gh actions is the same one used on the workers, yet the deployment writes successfully to the bucket, but the runtime write fails.
  • @cisaacstern said that in CI testing of pgf runner they use a non-persistent bucket for the temp_storage_location without issues -> Is something specific in the retention policies messing things up?
@cisaacstern
Copy link
Member

Hypotheses:

Is something specific in the retention policies messing things up?

Yes, could be some subtle difference in the retention policies of our CI bucket for pangeo-forge-runner vs. the leap-scratch bucket. We should take a close look at any differences between these two

Unable to create directory: gs://leap-scratch/data-library/temp/gh-leap-stc-data-management-da1b838-1683838917.1683838945.611305/dax-tmp-2023-05-11_14_02_30-7266315841491564283-S05-0-dcd150a5967231a.

The fact that the error is in creating a directory feels possibly significant. When this job ultimately succeeded (using the persistent bucket for tmp_storage_location), we found a set of .recordio files in that directory. I have not previously seen these files (or their associated directory) in tmp buckets for other dataflow jobs, but admittedly I have also not looked too closely at what was in the tmp bucket.

I wonder if there is something specific about this dataflow job which prompted dataflow to create this directory of .recordio files, and in fact the issue is that creating an empty directory (which would subsequently be populated by these files) in a non-persistent bucket is what is raising the error? And we simply haven't hit it before because out CI pipeline doesn't trigger creation of these files?

@cisaacstern
Copy link
Member

Plot twist! Dataflow appears to delete this directory by the time the job is complete, whereas other objects in the tmp directory persist after job completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants