-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume (-r) does not work the same with arxiv, as with EUPMC API #158
Comments
@sedimentation-fault at this point I think it's worth saying that you've discovered enough bugs that it's clear we need to rewrite some fundamentals. We knew about some of these internally for some time, but haven't had the resources to resolve them. If you could refrain from reporting new issues for now that would be helpful - the existing reports have enough detail that we have a good set of goals to test against when rewriting. At this point more issues just gives us more admin to do afterwards. We really appreciate the detailed analysis so far though :) |
The good thing is, all this discussion helps us to close bugs too, like #157 😉 But, no problem. I understand - I will keep quiet for a while... 😄 |
Let me break the silence and tell you the solution - there is no point in letting you tap in the dark. 😄 SolutionThis bug is actually two: Do not redownload papers if they are already thereReasonI thought this had to do with the '-r' option, but it does not. download.js already checks if a file with the same name is already there and does not retry to download it if there is one. BUT: the test fails for arxiv papers because their IDs contain slashes - and IDs are not sanitized in the function that does the check. SolutionIn download.js, replace
with
Do not retry the query if the user passed the '-r' optionSolutionIn arxiv.js, replace
with
Still in arxiv.js, add:
after
For your convenience, I attach two patches. Inspect before you apply them - the patch for download.js contains commented and debugging code too (I found it informative in this form, YMMV). |
Description
While appending '-r' to a getpapers command will not retry to download PDFs that are already there for queries with the EUPMC API, the same is NOT the case with the arxiv API - it not only retries the metadata query, but also each and every PDF, from the first to the last one.
This is very annoying if you have 24000 PDFs to download, of which a few thousand have already been downloaded in previous attempts.
How to reproduce
Run
getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p
Let it download some number of PDFs, then interrupt it and retry with
getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p -r
Side effects
An annoying side-effect of this is that the web server will reply with a
416 Range Not Satisfiable
error, if the downloader (be it the standard requestretry from master code, or a curl wrapper from a custom fork, as in #157) tries a resumable download (i.e. if it tries to resume the download from where it stopped, in this case from the end of an already completely downloaded file, a byte range whose start is certainly not satisfiable). If the error is not caught, the script will stop at the first file that is already there, with the above 416 error.
This side effect may not occur in the master code (or it may appear as a 416 caught error, I didn't check).
The text was updated successfully, but these errors were encountered: