Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3s-fs: fix incomplete uploads by writing via a temp file #116

Merged
merged 1 commit into from
Dec 6, 2023

Conversation

amunra
Copy link
Contributor

@amunra amunra commented Dec 5, 2023

I've noticed that occasionally put object requests would result in empty or partially written objects.
After a bit of debugging this turned out to be cases where the S3 client or s3s-fs was interrupted/killed mid-upload.

The current implementation writes directly to the final destination, which can cause partial writes.

This PR writes to a temporary file path first, then moves to the final location.
This should also take care of cases when the s3s-fs was killed abruptly mid-write.

To handle this, there's also a start-up clean-up of any leftover outstanding partially written files.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@amunra amunra force-pushed the upload_via_tmp_file branch 2 times, most recently from 80135cc to f3cc793 Compare December 5, 2023 20:18
@amunra amunra marked this pull request as draft December 5, 2023 20:18
@amunra amunra force-pushed the upload_via_tmp_file branch from f3cc793 to 298c268 Compare December 5, 2023 21:28
@amunra amunra marked this pull request as ready for review December 5, 2023 21:40
@Nugine Nugine mentioned this pull request Dec 6, 2023
6 tasks
@Nugine Nugine merged commit 7b69f21 into Nugine:main Dec 6, 2023
7 checks passed
@Nugine
Copy link
Owner

Nugine commented Dec 6, 2023

Using temp files is a good practice. Is there any existing library for this?

@amunra
Copy link
Contributor Author

amunra commented Dec 6, 2023

Using temp files is a good practice. Is there any existing library for this?

I had concerns about additional dependencies. If this is not an issue it would be indeed better IMO. IIRC, temp files get cleared automatically on process exit, so the startup loop would go away.

There's a few things I spotted and didn't address:

  • I think there should be a map of async locks to avoid having the same path overwritten concurrently. Depending on the OS / timing, the current code will either corrupt writes or produce errors, while a lock would hold up (effectively queuing) the request.
  • Temp files should be used for the parts of multipart uploads too to ensure consistency of the parts. My change introduces them only for the final "pasting" step.

@amunra
Copy link
Contributor Author

amunra commented Dec 6, 2023

It turns out using a tempfile here might not be easy.
The reason for that is that the tempfile is only a temp file temporarily, and would need to be converted into a regular file at some point. Additionally, temp files aren't guaranteed to be cleaned up in case of crashes, so I think the PR that got merged is a reasonable compromise taking simplicity into account.
Also see: Stebalien/tempfile#13 (comment)

@amunra
Copy link
Contributor Author

amunra commented Dec 6, 2023

There's a few things I spotted and didn't address:

* I think there should be a map of async locks to avoid having the same path overwritten concurrently. Depending on the OS / timing, the current code will either corrupt writes or produce errors, while a lock would hold up (effectively queuing) the request.

* Temp files should be used for the parts of multipart uploads too to ensure consistency of the parts. My change introduces them only for the final "pasting" step.

I've now raised a second PR (#117) that completes this one addressing the latter of these two outstanding items.
I'm not going to address the locking one as it currently doesn't affect us in our use of s3s-fs.

The last release was in September, any chance you'd be able to cut a new release over the next few days? It would cut down our CI times.

@amunra amunra deleted the upload_via_tmp_file branch December 6, 2023 10:36
@Nugine
Copy link
Owner

Nugine commented Dec 6, 2023

Ok. The next release is scheduled for Dec 9th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants