-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement] Pip could resume download package at halfway the connection is poor #4796
Comments
I don't know how pip's hashing works, but here's some working, easy, simple, modular resume code in a single file/function: https://gist.github.com/CTimmerman/ccf884f8c8dcc284588f1811ed99be6c |
I have a poor connection and I often resume pip manually using wget. This is easy for wheel using wget -c, then you can install the wheel with pip, but when it's a tarball I have to use the setup script and don't get the same result even though in the end it works. |
This should be easier to implement now since all the logic regarding downloads is isolated in |
Any updates on this? I was installing a huge package (specifically Tensorflow, 500+MB), and for some reason pip was killed in 99% of download... Re run the command and it started downloading from 0... |
I have a few questions about the design for this enhancement. First, why (or how) does this happen?
My guess would be that back then the wheels are stored directly to the cache dir instead of being downloaded to a temporary location like it is handled now. Thus the hashing error should be solved. However, because the wheel being downloaded is in a directory that will be cleaned up afterward, do we want to expose that mechanism to be configurable (e.g. |
same with pytorch which was 1 GB in size. days quota just got exhausted and no fruitful result. |
fwiw, you can always curl manually (applying the resuming logic you need and checking the integrity manually) and |
|
I'd like to give this a try and created a proof of concept PR here: #11180. I'm not quite sure what the command line options will look like for this feature. I imagine we will need new options to turn on/off this feature and limit the number of retries (this is different from the |
If this gets implemented, I would want it to be enabled by default, and fallback automatically to the previous implementation if resuming is not successful (e.g. if the server does not support resuming). This matches the behaviour of normal downloading clients e.g. web browsers. |
How about the number of attempts? Should we keep making new requests as long as the responses have successful status code (e.g. 200) and non-empty bodies (some progress is made in each request)? |
Instead of trying to guess how many attempts is reasonable, perhaps pip should store the incomplete download somewhere (e.g. in cache?) and resume it on the next |
|
Currently pip uses CacheControl to handle HTTP caching, but it doesn't cache responses with incomplete bodies (or Also I'm not sure if the browser behavior is desirable in this case. With large wheels (e.g. pytorch > 2 GB) and my crappy Internet it consistently fails 4~5 times before completing. If users are installing many large packages (e.g. from a requirements.txt) having to manually resume multiple times can be annoying. That's why I think opt-in might work better. In most cases resuming is not required, but in the case it does we can present a warning informing the users that 1) the download is incomplete, and 2) they can use some command line option to automatically resume download next time. |
One caveat with trying to mimck the browser is that, unlike the browser's UI which lets the user cancel / pause / resume any specific download, pip doesn't have such a rich user interface via the CLI. We'd need to, at least, provide one knob for this resuming behaviour -- either to opt-in or opt-out. I think when you're not in "resume my downloads" mode, pip should also clean up any existing incomplete downloads. That said, picking between opt-in vs opt-out is not really blocker to needing to implement either behaviours. It's a matter of changing a flag's default value in the PR (let's use a flag with values like |
I think my PR #11180 is ready for a first round of review. Suggestions for more meaningful flag names, log messages, and exception messages are welcome. |
Having the same problem downloading pytorch + open-cv on a streamlit project for the third time today (connection lost after 6 hours...), I wonder if making pip able to use an external downloader could be a thing ? yt-dl provides : --external-downloader COMMAND Use the specified external downloader.
Currently supports aria2c,avconv,axel,c
url,ffmpeg,httpie,wget
--external-downloader-args ARGS Give these arguments to the external
downloader A kind of |
Which of those also works on Windows? Resuming HTTP downloads is simple, as evident by the PR at #11180 which is fine by me, but i feel it's such a basic feature it should be supported upstream. |
We're not going to be using an external programme for network interaction within pip. This should be implemented as logic within pip itself. |
What's the progress on this feature? It's annoying trying to install packages like tensorflow and pytorch and then getting errors when the downloads are almost complete |
I have a proof-of-concept PR here: #11180. It's been a while since I last worked on it, and there has been some discussion about the user interface that I haven't incorporated into the PR. Personally I feel like the major problems are:
I think it will be nice if we can have some input from the maintainers, e.g., priorities, expectations, etc. |
By upstream do you mean requests? As for which UX to use, I don’t think anyone really expressed strong opinions, but only pointed out things the end product needs to be handle. So the best approach to drive this forward would be to implement what you feel is best and see what people think of it. |
Sounds good. I'll update that PR when I got time (been busy lately). |
2024 still no resume for large packages, the connection is closed by the server and i have to start numpy and psyspark over and over again, a resume would save a lot of resources as pip retrieves the same stream also all over again. |
Yes very necessary |
Currently in the western Amazon on a starlink connection trying to download birdnetlib and this is killing me. It would be so much better to use the rsync protocol with checksums. |
Hello everyone 👋 I opened a PR for this one (#12991). Happy to hear your thoughts and finally get it merged! |
Description
When I have poor internet connection (the network is cut unexpectedly), updating pip package is painful. When I retry the pip install, it would stop at the midpoint and give me the same md5 error.
All I have to do is
If pip download have resume feature then the problem could be solved.
What I've run
pip install -U jupyterlab
in poor network conditionThe text was updated successfully, but these errors were encountered: