-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupt PDFs from EuropePMC #166
Comments
Have repeated with arXiv : |
Confirmed on node 8.5.0 using Okular as PDF reader. However, it seems to work fine for me with getpapers 0.4.14. |
The second thing gives a 403 message concerning the user agent, confirmed on 0.4.14 as well. Edit: Confirmed with ModHeader that
into
so I might have only blacklisted myself. |
I had a similar issue with PDFs from a arxiv search (I'm running the process with getpapers 0.4.17 on a Windows 10.0.19041 machine)
When I opened the downloaded pdf file in Notepad++ I got this:
Hope this information works to solve the issue :) |
We (@ayushgarg) are writing |
I'm thinking what to do. Some time ago we talked with Paul Ginsparg and it
was OK but later the header got banned. Downloading huge chunks of arxiv
through FTP dumps is no use to us. I haven't revisited this problem since
we don't currently use arxiv in our projects .
Ayush, can you think about some idea of the technical processes involved
and we'll discuss it on Slack or catchup?
…On Fri, Mar 5, 2021 at 5:43 PM Silvia Gutiérrez ***@***.***> wrote:
Thanks for the quick reply @petermr <https://github.com/petermr>!
It may or may not have something to do with what they say here in the
"Play nice" section of this text <https://arxiv.org/help/bulk_data> i.e.
on how harvesting should be done using the dedicated site (
export.arxiv.org) and how one should use the 1 second sleep per burst?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#166 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS46FPAZIZUZUCONIG3TCEJ25ANCNFSM4D4HR6HQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
I am running
getpapers
and every PDF download is corrupt.none of the PDFs read into AdobeReader or PDFDebugger (PDFBox). The structure of the document appears to be correct (it identifies pages, but they are visually blank and have virtually no content).
I have uninstalled and then reinstalled
getpapers
.Running on MACOsx 10.9.5
The text was updated successfully, but these errors were encountered: