Corrupt PDFs from EuropePMC #166

petermr · 2017-09-24T13:36:05Z

I am running getpapers and every PDF download is corrupt.

localhost:workspace pm286$ node --version
v6.2.1
localhost:workspace pm286$ getpapers --version
0.4.12
localhost:workspace pm286$ getpapers -p -x -q "systematic review" -k 200 -o systematic

none of the PDFs read into AdobeReader or PDFDebugger (PDFBox). The structure of the document appears to be correct (it identifies pages, but they are visually blank and have virtually no content).

I have uninstalled and then reinstalled getpapers .

Running on MACOsx 10.9.5

The text was updated successfully, but these errors were encountered:

petermr · 2017-09-24T13:40:50Z

Have repeated with arXiv : --api arxiv -p. This gives PDFs of uniformly 2KB. This is clearly wrong but I don't know what the correct answer is.

larsgw · 2017-09-24T14:23:33Z

Confirmed on node 8.5.0 using Okular as PDF reader. However, it seems to work fine for me with getpapers 0.4.14.

larsgw · 2017-09-24T14:39:24Z

The second thing gives a 403 message concerning the user agent, confirmed on 0.4.14 as well.

Edit: Confirmed with ModHeader that User-Agent is the problem. See https://arxiv.org/denied.html, https://arxiv.org/help/robots. Using User-Agent: getpapers/TDM worked once, but after changing the header to that in config.js, it broke too (some sort of blacklist, I guess). The error page, however, changed too, changing

Sadly, your client "getpapers/(TDM Crawler [email protected])" violates the automated access guidelines posted at arxiv.org, and is consequently excluded.

into

 Sadly, you do not currently appear to have permission to access
https://arxiv.org/pdf/0710.0054v1.pdf

so I might have only blacklisted myself.

silviaegt · 2021-03-05T16:11:34Z

I had a similar issue with PDFs from a arxiv search (I'm running the process with getpapers 0.4.17 on a Windows 10.0.19041 machine)
I used the following query:

getpapers --api "arxiv" --query "abs:wikidata" --pdf --outdir wikidataarticles --limit 10

When I opened the downloaded pdf file in Notepad++ I got this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>Access Denied</h1>

 <p>Sadly, your client "<b>getpapers/(TDM Crawler [email protected])</b>" violates
 the automated access guidelines posted at arxiv.org,
 and is consequently excluded.</p>

Hope this information works to solve the issue :)

petermr · 2021-03-05T17:18:02Z

arxiv don't like crawlers . I think it was OK 5 years ago. It banned the crawler through its email. I am not sure how they would want it to be mined - they have a raw file download dump, I think.
The files aren't corrupt - they are replaced by HTML

We (@ayushgarg) are writing pygetpapers and will need to think about this. If there is enough demand we will go back to arxiv and work out an. agreed solution.

silviaegt · 2021-03-05T17:42:55Z

Thanks for the quick reply @petermr!
It may or may not have something to do with what they say here in the "Play nice" section of this text i.e. on how harvesting should be done using the dedicated site (export.arxiv.org) and how one should use the 1 second sleep per burst?

petermr · 2021-03-05T19:10:28Z

I'm thinking what to do. Some time ago we talked with Paul Ginsparg and it was OK but later the header got banned. Downloading huge chunks of arxiv through FTP dumps is no use to us. I haven't revisited this problem since we don't currently use arxiv in our projects . Ayush, can you think about some idea of the technical processes involved and we'll discuss it on Slack or catchup?

…

On Fri, Mar 5, 2021 at 5:43 PM Silvia Gutiérrez ***@***.***> wrote: Thanks for the quick reply @petermr <https://github.com/petermr>! It may or may not have something to do with what they say here in the "Play nice" section of this text <https://arxiv.org/help/bulk_data> i.e. on how harvesting should be done using the dedicated site ( export.arxiv.org) and how one should use the 1 second sleep per burst? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#166 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS46FPAZIZUZUCONIG3TCEJ25ANCNFSM4D4HR6HQ> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

larsgw mentioned this issue Oct 18, 2017

arXiv.org PDFs denying access to User-Agent "getpapers/(TDM Crawler [email protected])" #167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupt PDFs from EuropePMC #166

Corrupt PDFs from EuropePMC #166

petermr commented Sep 24, 2017

petermr commented Sep 24, 2017

larsgw commented Sep 24, 2017

larsgw commented Sep 24, 2017 •

edited

Loading

silviaegt commented Mar 5, 2021 •

edited

Loading

petermr commented Mar 5, 2021

silviaegt commented Mar 5, 2021

petermr commented Mar 5, 2021 via email

Corrupt PDFs from EuropePMC #166

Corrupt PDFs from EuropePMC #166

Comments

petermr commented Sep 24, 2017

petermr commented Sep 24, 2017

larsgw commented Sep 24, 2017

larsgw commented Sep 24, 2017 • edited Loading

silviaegt commented Mar 5, 2021 • edited Loading

petermr commented Mar 5, 2021

silviaegt commented Mar 5, 2021

petermr commented Mar 5, 2021 via email

larsgw commented Sep 24, 2017 •

edited

Loading

silviaegt commented Mar 5, 2021 •

edited

Loading