Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume incomplete download #11180

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yichi-yang
Copy link

@yichi-yang yichi-yang commented Jun 10, 2022

Overview

This PR adds a feature that resumes download if the downloaded file is incomplete (e.g. when the Internet connection is poor). More specifically, if :

  1. The initial response includes a Content-Length header,
  2. The server sends fewer bytes than indicated in the Content-Length header,
  3. The user enables the auto resumption feature using --incomplete-downloads=resume,

the downloader will make new requests and attempt to resume download using a Range header. If the initial response includes an ETag (preferred) or Date header, the downloader will ask the server to resume download only when it is safe (i.e., the file hasn't changed since the initial request) using an If-Range header.

If the server responds with a 200 (e.g. if the server doesn't support partial content or can't check if the file has changed), the downloader will restart the download (i.e. start from the very first byte); if the server responds with a 206 Partial Content, the downloader will resume the download from the partially downloaded file.

Note if the server always responds with 200, the downloader can potentially get stuck and waste unreasonable amounts of bandwidth downloading the first few bytes over and over again. Therefore, a retry limit is introduced to avoid this case.

If not enough bytes are received and auto resumption is disabled or the retry limit is exceeded, the downloader will clean up the incomplete file and fail with an exception.

Flags

To control the auto resumption behavior, two new flags are added:

  • --incomplete-downloads=resume,/discard controls whether the auto resumption feature is enabled (defaults to discard);
  • --incomplete-download-retries limits the maximum number of retries (defaults to 5).

Towards #4796

@yichi-yang yichi-yang force-pushed the feature-resume-download branch 3 times, most recently from 25cab52 to c869789 Compare July 17, 2022 01:25
@yichi-yang yichi-yang marked this pull request as ready for review July 17, 2022 04:25
Comment on lines 229 to 245
if total_length is not None and bytes_received < total_length:
if self._resume_incomplete:
logger.critical(
"Failed to download %s after %d resumption attempts.",
link,
self._resume_attempts,
)
else:
logger.critical(
"Failed to download %s."
" Set --incomplete-downloads=resume to automatically"
"resume incomplete download.",
link,
)
os.remove(filepath)
raise RuntimeError("Incomplete download")
Copy link
Author

@yichi-yang yichi-yang Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what to do here. I don't think throwing an exception (L244) and printing a long (and useless) stack trace is user-friendly, but I can't think of a better alternative. Maybe we should just reuse the the same log messages above so it's at least helpful?
I think throwing an (subclassed) DiagnosticPipError here might be a good idea? We can let the user know that:

  1. the download is incomplete
  2. the incomplete file has been cleaned up
  3. they can use --incomplete-downloads=resume to enable the feature if they haven't already
  4. they can modify the retry limit with --incomplete-download-retries.

class DiagnosticPipError(PipError):
"""An error, that presents diagnostic information to the user.
This contains a bunch of logic, to enable pretty presentation of our error
messages. Each error gets a unique reference. Each error can also include
additional context, a hint and/or a note -- which are presented with the
main error message in a consistent style.
This is adapted from the error output styling in `sphinx-theme-builder`.
"""

Copy link
Author

@yichi-yang yichi-yang Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the above in a separate commit: a082517.

@yichi-yang yichi-yang force-pushed the feature-resume-download branch from c869789 to d655669 Compare July 17, 2022 17:22
@yichi-yang yichi-yang force-pushed the feature-resume-download branch from 93b9d14 to a082517 Compare July 17, 2022 19:06
@uranusjr
Copy link
Member

I saw that a previous version also queries Accept-Ranges to determine whether to go into using Range for resuming. Was it removed because it’s considered not a good idea?

@uranusjr
Copy link
Member

I wonder (discussions needed!) if we should aggressively set the default to resume. The logic seems safe enough and without much overhead (and the worse case scenario can only be caused by incorrect server implementation) that we should make the benefit opt-out instead opt-in.

@yichi-yang
Copy link
Author

yichi-yang commented Jul 17, 2022

I saw that a previous version also queries Accept-Ranges to determine whether to go into using Range for resuming. Was it removed because it’s considered not a good idea?

As per RFC 7233 servers that don't support Range will simply ignore it and respond with 200 as if it is a normal GET request. Additionally, if the server supports Range but the document has changed since the initial request, we need to somehow figure that out and restart instead of resume (that's what the If-Range header does).

So to answer your question, checking Accept-Ranges is kind of redundant here (it's up to the server to decide if a partial response is appropriate). I don't think we should check Accept-Ranges unless there's a very good reason to do so.

@yichi-yang
Copy link
Author

I think it might make more sense if we only have a single --incomplete-download-retries flag - setting it to 0 or a negative number disables the feature. This is more similar to how --retries works currently.

@uranusjr
Copy link
Member

Ah yes that makes sense (perhaps with a shorter option name though)

@yichi-yang
Copy link
Author

(perhaps with a shorter option name though)

Suggestions are welcome :)
Though we need to make sure users won't confuse this with --retries.

@CTimmerman
Copy link

CTimmerman commented Jul 19, 2022

Just use --retries. It's the same as this with 0 bytes.

@yichi-yang
Copy link
Author

Just use --retries. It's the same as this with 0 bytes.

Could you elaborate? I don't think --retries on its own solves the issue #4796 here.

@uranusjr
Copy link
Member

I think it’s “just use --retries to control resuming”, in the sense that right now a retry is conceptually the same as resuming from a previous download of zero bytes.

@yichi-yang
Copy link
Author

I think it’s “just use --retries to control resuming”, in the sense that right now a retry is conceptually the same as resuming from a previous download of zero bytes.

Ah I see. Our resume function is conceptually similar to --retries, but implementation-wise --retries are handled by requests/urllib3. I don't really know a good way to combine the two. Consider the case where a user sets --retries 5, does it mean requests can make 5 retries AND pip can try resuming 5 times (kind of weird), or requests and pip can retry a total of 5 times (hard to implement, retry now means two different things)?

On a side note, part of the problem should get fixed upstream soon (psf/requests#4956, psf/requests#6092), but that doesn't change the fact that pip has to make additional partial requests to resume.

@uranusjr
Copy link
Member

Yeah that’s the problem, we can’t really use the same counter between connection retries and resumes. But I’m guessing it’s not that big a problem and we could intentionally implement the wrong behaviour and find a way to fix that later… For example for retries=5 we would do 5 connection retries and 5 resume retries, and most of the time the user wouldn’t tell the difference, since it’s relatively rare to see a connection fails halfway, and when resuming starting to fail to connect. And even in that case users might be fine to do a few more retries, who knows. I think I’d prefer to get the user interface (CLI flags) right, and “fix” the relatively minor behaviour later (if ever).

@yichi-yang
Copy link
Author

Yeah that’s the problem, we can’t really use the same counter between connection retries and resumes. But I’m guessing it’s not that big a problem and we could intentionally implement the wrong behaviour and find a way to fix that later… For example for retries=5 we would do 5 connection retries and 5 resume retries, and most of the time the user wouldn’t tell the difference, since it’s relatively rare to see a connection fails halfway, and when resuming starting to fail to connect. And even in that case users might be fine to do a few more retries, who knows. I think I’d prefer to get the user interface (CLI flags) right, and “fix” the relatively minor behaviour later (if ever).

Good point. But that's an easy change so I think we can wait a bit and see what others think.

@CTimmerman
Copy link

CTimmerman commented Jul 23, 2022

I think the default 5 connection retries should include resuming, and that resuming should not start from 0 by default, but would not mind reusing the user's --retries value if the actual one is not available. Resuming should be an upstream feature if retrying is.

@horacehoff
Copy link

Imo the download shouldn't resume by starting by 0. Let it resume using the already-partially-downloaded file.

@yichi-yang
Copy link
Author

yichi-yang commented Sep 2, 2022

Imo the download shouldn't resume by starting by 0. Let it resume using the already-partially-downloaded file.

The implementation in this PR resumes from partially downloaded file when possible. There are cases where resuming is not possible, e.g. when the file has changed on the server after we started the download, and we have to start downloading from scratch again.

Is that the behavior you want?

@horacehoff
Copy link

Oh right, sorry. I misunderstood exactly what this PR does and from this new point of view I have nothing to say as it sounds much useful to many people who have poor/low bandwidth.

@JD91B
Copy link

JD91B commented Dec 14, 2022

I was having problems no being able to download large files with pip because of my slow internet but this helped so much. Thank you, it was really useful.

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants