Skip to content

Fix wheel cache concurrency when temp and cache are not on the same filesystem #13541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sbidoul
Copy link
Member

@sbidoul sbidoul commented Aug 12, 2025

Avoid concurrency issues when storing wheels into the cache of locally built wheels, when the temporary directory into which the wheel is built is on a different filesystem than the cache directory.

We do this by building the wheel to a temporary subdirectory next to the target then doing an atomic rename to the target destination.

This also improves performance in this situation by avoiding a copy from one filesystem to the other.

Fixes #13540

@sbidoul sbidoul changed the title Fix wheel cache concurrency sbi Fix wheel cache concurrency Aug 12, 2025
@notatallshaw
Copy link
Member

notatallshaw commented Aug 12, 2025

Is this writing temporary files to the cache directory even if the user has explicitly set TEMPDIR, TEMP or TMP to somewhere else?

If so I'm not comfortable with this, while generally having cache directory and temporary directory on the same mount is usually correct, there may be valid reasons users don't want to do that.

In fact I'm not sure why the solution isn't to advise users to set the temporary directory and the cache directory to the same mount?

@sbidoul
Copy link
Member Author

sbidoul commented Aug 12, 2025

Is this writing temporary files to the cache directory even if the user has explicitly set TEMPDIR, TEMP or TMP to somewhere else?

I don't think pip will write temporary files there, so it depends on the build backend. I had a quick look at PEP 517 and it does not seem to say anything about whether the build backend can use ẁheel_directory as a location for temporary files in addition to the resulting built wheel. So I suppose it could? I don't think that is a common behaviour or even good practice for a build backend, though.

I started with an alternative solution which is passing a safe_copy2 function to shutil.move. I can still do that of course at the cost of losing the performance improvement.

In fact I'm not sure why the solution isn't to advise users to set the temporary directory and the cache directory to the same mount?

I am considering this typical use case of a Dockerfile where the pip cache is mounted:

RUN --mount=type=cache,target=/root/.cache/pip,id=pip-distro \
         pip install stuff

In that case I don't think it is possible to force the temp dir to be on the same filesystem as the cache mount.

@notatallshaw
Copy link
Member

notatallshaw commented Aug 12, 2025

In that case I don't think it is possible to force the temp dir to be on the same filesystem as the cache mount.

I can't validate right but my instinct is you should be able to update this to something like:

RUN --mount=type=cache,target=/root/.cache/pip,id=pip-distro \
         TEMPDIR=/root/.cache/pip/tmp pip install stuff

Or set TEMPDIR as an ENV var earlier on the docker build script.

I can try a few approaches later tonight and validate, or invalidate, whether an approach like this works.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 12, 2025

Well, okay that will work, but that sounds like a hack right? And I'm not quite convinced that using the docker cache mount as temp location is good practice.

Unless I am missing something this temp_dir I'm changing here is passed directly as wheel_directory argument to the build backend, and a well behaving build backend should not write temp files there but rather honor TEMPDIR/TEMP/TMP.

@notatallshaw
Copy link
Member

Well, okay that will work, but that sounds like a hack right?

How temp env vars work is inherited from Python's behavior: https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir

I don't think this is a hack? Or at least I've relied on it in the past.

And I'm not quite convinced that using the docker cache mount as temp location is good practice.

Wait, isn't that the purpose of this PR? You want the temp location to be the same as the cache mount? Maybe I got confused what this is doing.

Unless I am missing something this temp_dir I'm changing here is passed directly as wheel_directory argument to the build backend, and a well behaving build backend should not write temp files there but rather honor TEMPDIR/TEMP/TMP.

I'm not clear on this, I will need to read through the code and specs later tonight.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 12, 2025

Sorry I was not clear. Apologies if did sound offensive, that was not my intent.

Setting TEMPDIR is totally fine of course. It's just that using a Docker cache mount as a generic temp location does not look quite right to me.

Wait, isn't that the purpose of this PR? You want the temp location to be the same as the cache mount?

Kind of. But this temp location is used only for the result of the build which is immediately moved to the cache. So it is not for all temporary files created during the build.

I just pushed another commit renaming that variable to make that more explicit.

@sbidoul sbidoul force-pushed the fix-wheel-cache-concurrency-sbi branch from 59f863e to e4b8ef0 Compare August 12, 2025 17:10
@notatallshaw
Copy link
Member

Sorry I was not clear. Apologies if did sound offensive, that was not my intent.

No issues from my side! I didn't perceive any offense.

Kind of. But this temp location is used only for the result of the build which is immediately moved to the cache. So it is not for all temporary files created during the build.

Ahh! This makes way more sense to me now. I'll refrain from reviewing any further until I can read more context and validate it on my own machine.

Avoid concurrency issues when storing wheels into the cache of locally built wheels,
when the temporary build directory is on a different filesystem than the cache directory.

This commit also improves performance in such situations, by directly building the wheel
into the cache filesystem.
@sbidoul sbidoul force-pushed the fix-wheel-cache-concurrency-sbi branch from e4b8ef0 to c44bf1b Compare August 13, 2025 07:47
@sbidoul sbidoul changed the title Fix wheel cache concurrency Fix wheel cache concurrency when temp and cache are not on the same filesystem Aug 13, 2025
@notatallshaw
Copy link
Member

notatallshaw commented Aug 14, 2025

I'm still a bit uneasy on this, this is creating a temporary directory in a cache directory where we weren't before without give the user anyway to override.

Maybe I'm overthinking things, but I'm imagining some user depending on their temporary directory to be ephemeral and their cache directory generally being write only (no deletes). Or the cache directory is regularly synced to an archive, and now those temporary directories could get caught up in that.

Though I agree this is almost certainly a good default, and maybe we see if anyone complains before considering user customization for this.

# We can do a rename here because wheel_path is guaranteed to be
# in the same filesystem as output_dir. An atomic is rename
# to avoid concurrency issues when populating the cache.
os.rename(wheel_path, dest_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why switch from shutil.move to os.rename? shutil.move uses rename first and falls back to copy: https://docs.python.org/3/library/shutil.html#shutil.move

While the temporary path probably supports os.rename, the cache directory could be on an esoteric filesystem that does not.

Copy link
Member Author

@sbidoul sbidoul Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cache filesystem does not support os.rename, then letting shutil.move fall back to a copy risks re-creating the concurrency issue I'm fixing here. I don't know if such esoteric file systems exist but perhaps we could address that if someone complains? I can commit to following up on any issue raised about this post release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair.

For reference this is the PR that changed it from os.rename to shutil.move : #2749

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking it should be os.replace() instead of rename, though.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 14, 2025

I'm still a bit uneasy on this, this is creating a temporary directory in a cache directory where we weren't before without give the user anyway to override.

Maybe I'm overthinking things, but I'm imagining some user depending on their temporary directory to be ephemeral and their cache directory generally being write only (no deletes). Or the cache directory is regularly synced to an archive, and now those temporary directories could get caught up in that.

Though I agree this is almost certainly a good default, and maybe we see if anyone complains before considering user customization for this.

The alternative I initially considered to fix the concurrency issue, was keeping the current temp dir location, and then do a shutil.move(src, dst, copy_function=safe_copy2) where safe_copy2 would copy to a temp file into the cache directory followed by a os.rename. That works but that also creates a temp file in the cache filesystem so I found it nice to improve the performance at the same time. An alternative could be using a lock file but that probably places even more constraints on the capabilities of the cache filesystem and that does not improve performance either. Other approaches I can think of quickly become convoluted.

@notatallshaw
Copy link
Member

The alternative I initially considered to fix the concurrency issue, was keeping the current temp dir location, and then do a shutil.move(src, dst, copy_function=safe_copy2) where safe_copy2 would copy to a temp file into the cache directory followed by a os.rename. That works but that also creates a temp file in the cache filesystem so I found it nice to improve the performance at the same time.

I agree this approach is better

An alternative could be using a lock file but that probably places even more constraints on the capabilities of the cache filesystem. Other approaches I can think of quickly become convoluted.

FWIW uv explicitly supports concurrency, and I believe uses lock files, but I haven't spent any time looking at their design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wheel cache concurrency issue when temp dir is not on the same filesystem as the cache dir
2 participants