Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create example scrapers with example results #10

Open
klartext opened this issue Aug 8, 2015 · 19 comments
Open

Create example scrapers with example results #10

klartext opened this issue Aug 8, 2015 · 19 comments

Comments

@klartext
Copy link

klartext commented Aug 8, 2015

It would be nice to have real-world example-json files
together with directory/file-collection, which are created by
running a scraper with a certain scraperJSON-json file.

That would be helpful to implement a scraper that follows the scraperJSON scheme/policy.

A *.zip or *.tgz file for results (or json-file and results) would make sense as examples, IMHO.

@petermr
Copy link
Member

petermr commented Aug 8, 2015

Chris (copied) has written a tutorial on this which we released last week
and this should provide what you want. Chris, can you point to this and see
if it's what is wanted? thx

On Sat, Aug 8, 2015 at 3:57 PM, klartext [email protected] wrote:

It would be nice to have real-world example-json files
together with directory/file-collection, which are created by
running a scraper with a certain scraperJSON-json file.

That would be helpful to implement a scraper that follows the scraperJSON
scheme/policy.

A *.zip or *.tgz file for results (or json-file and results) would make
sense as examples, IMHO.


Reply to this email directly or view it on GitHub
#10.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@blahah
Copy link
Member

blahah commented Aug 8, 2015

@klartext

@blahah blahah changed the title RealWorld-Example json-file and dir-with-downloaded-files would be nice Create example scrapers with example results Aug 8, 2015
@blahah
Copy link
Member

blahah commented Aug 8, 2015

Ah, I now realise that you perhaps mean that the example scrapers should come with example results - is that the case? If so, that's an excellent idea and I will make it a priority.

@klartext
Copy link
Author

klartext commented Aug 8, 2015

Yes, I meant examples that has three things:

  • json-file
  • realworld example (what paper is downloaded and how - e.g. example data that is input via file/stdin/cli/gui to select those papers that want to be downloaded)
  • realworld results in form of example directory-with-downloaded-content

Just fom the json-files it's not clear, how to interpret them.
Is ee a lot of selector-strings like "//meta[@name='citation_publisher']"
but how is that be used?
Is this an OPTION to select via cli/gui/... or what does this mean?
So there is some ambiguity in interpreteing the json files.
So a realworld example (e.g. element/selector foobar is used with cli-switc foobar ??? and searchkeywords are e.g. "horizontal._gene._transfer" or so, and this results in a dir with a pdf....)

@petermr
Copy link
Member

petermr commented Aug 8, 2015

@klartext

Or recent workshop is covered in:

https://github.com/ContentMine/workshop-resources/

You will find many of your questions addressed here. Suggest you work
though the getpapers ans Scraper tutorials and let us know if what you
want is not there.

On Sat, Aug 8, 2015 at 9:21 PM, klartext [email protected] wrote:

Yes, I meant examples that has three things:

  • json-file
  • realworld example (what paper is downloaded and how - e.g. example
    data that is input via file/stdin/cli/gui to select those papers that want
    to be downloaded)
  • realworld results in form of example
    directory-with-downloaded-content

Just fom the json-files it's not clear, how to interpret them.
Is ee a lot of selector-strings like "//meta[@name
https://github.com/name='citation_publisher']"
but how is that be used?
Is this an OPTION to select via cli/gui/... or what does this mean?
So there is some ambiguity in interpreteing the json files.
So a realworld example (e.g. element/selector foobar is used with
cli-switc foobar ??? and searchkeywords are e.g. "horizontal._gene._transfer"
or so, and this results in a dir with a pdf....)


Reply to this email directly or view it on GitHub
#10 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@klartext
Copy link
Author

klartext commented Aug 9, 2015

@petermr
OK, I read the "getpapers-tutorial.md" from the workshop-ressources.
This has explained a lot and also answered some questions.

But where does scraperJSON fit in?
That is not explained, and I could not find json-files for getpaper.
Is scraperJSON just a new idea for newer/planned scrapers?
Or is it used somewhere already? And if so: where, and how to interpret it?

@blahah
Copy link
Member

blahah commented Aug 9, 2015

@klartext yes, I think we need some overview documentation.

getpapers and quickscrape are both tools you can use to get scientific papers en masse for content mining.

getpapers allows you to search for papers on EuropePubMed, ArXiv or IEEE. You get metadata of all the hits to your query. You can optionally also try to download PDF, XML and/or supplementary data for the hits, but not all papers are downloadable this way.

quickscrape is a web scraping tool. You give it a URL and a scraper definition (in scraperJSON format) and it will scrape the URL using the scraper definition to guide it. We have a collection of scraperJSON definitions for major publishers and journals over at journal-scrapers. quickscrape is useful when you want to download things that (a) were in your getpapers results but getpapers couldn't get the PDF/XML/supp or (b) are not contained in the source databases that getpapers uses.

So, getpapers can data, very fast, from a subset of the literature. Quickscrape can get the same data, it takes more work but you can theoretically use it on any page on the internet.

@blahah blahah closed this as completed Aug 9, 2015
@blahah blahah reopened this Aug 9, 2015
@petermr
Copy link
Member

petermr commented Aug 9, 2015

@klartext as Richard says you need to read about quickscrape. The
workshop tutorial is at
https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/quickscrape
which should give a reasonable introduction to quickscrape and the format
of scrapers.

On Sun, Aug 9, 2015 at 2:31 AM, klartext [email protected] wrote:

@petermr https://github.com/petermr
OK, I read the "getpapers-tutorial.md" from the workshop-ressources.
This has explained a lot and also answered some questions.

But where does scraperJSON fit in?
That is not explained, and I could not find json-files for getpaper.
Is scraperJSON just a new idea for newer/planned scrapers?
Or is it used somewhere already? And if so: where, and how to interpret it?


Reply to this email directly or view it on GitHub
#10 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@klartext
Copy link
Author

klartext commented Aug 9, 2015

@petermr: OK, I read the quickscrape-tuorial.
It explains how it works.
But how to create / interpret the jsonSCRAPER-files? (Should print that question in a loop?)
The link to "create your own definitions" gives me 404-error.
The link to ctree also gives me 404-error.

@blahah
Copy link
Member

blahah commented Aug 9, 2015

@klartext how about this: https://github.com/ContentMine/ebi_workshop_20141006/tree/master/sessions/6_scrapers

It's the session on scrapers that I wrote for a workshop last year. It includes guides on creating selectors, as well as basic and advanced scraperJSON.

We need to update some of these resources as there are now more features available, but that should get you started.

@petermr
Copy link
Member

petermr commented Aug 10, 2015

@klartext,
Thanks for this. Your engagement helps to drive our documentation and also
shows up places where we have a need to develop software.

"But how to create / interpret the jsonSCRAPER-files? "
Probably trivial point: we expect people to use a text editor, probably
starting with a generic scraper template/example. Not sure whether it's
worth developing a specific tool.

If you aren't familiar with XPath see https://en.wikipedia.org/wiki/XPath.
(We use version 1.0). There are also many online tutorials and some are
interactive.

On Sun, Aug 9, 2015 at 10:15 PM, klartext [email protected] wrote:

@petermr https://github.com/petermr: OK, I read the quickscrape-tuorial.
It explains how it works.
But how to create / interpret the jsonSCRAPER-files? (Should print that
question in a loop?)
The link to "create your own definitions" gives me 404-error.
The link to ctree also gives me 404-error.


Reply to this email directly or view it on GitHub
#10 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@klartext
Copy link
Author

@blahah thanks, that text did help a lot. I was not knowing XPATH-stuff in detail, so the selector-syntax looked strange to me. After reading that, I saw, what the definition is about.
So, the selector selects tags/data from html-pages, and what is selected, is written in XPATH-syntax.
Some other things could be explained in the scraperJSON-docs: for example
(from example in scraperJSON-doc) "attribute": "content".
I guess, this means the part between opening and closing tag, like "ThisStuff" in "ThisStuff",
would mean "content" here...

@klartext
Copy link
Author

@petermr regarding the documentation: some links are going to nirvana, and some pics are not available. Also, some *.md files are just empty.
Regarding the need of software: what kind of software do you mean?
What did my comments show up as new software-need?
Isn't the content-mine stuff already a working collection of tools?

BTW: I saw one of your presentations. Very interesting, that the documents also will get analysed.
This is reverse-engineering of pdf's. Hardcore! :-)
I have some pdf-stuff that I would like to analyze that way (the papers /archive from the BCL (Biological Computer Laboratory, https://en.wikipedia.org/wiki/Biological_Computer_Laboratory)).
I thought about using tesseract-ocr, but it seems, ContentMine already provides a workflow/tools to do it easier.

Probably trivial point: we expect people to use a text editor, probably starting with a generic scraper template/example. Not sure whether it's worth developing a specific tool.

Well, I already have a tool, which is quite generically and can achieve, what getpapers and quickscrape offer as seperate tools. (but I have no javascript-engine in the background, at least now)
See here: https://github.com/klartext/any-dl

Regarding XPATH: yes, I was not familiar with it; I looked for this Wikipedia-article by myself, after I read Blahah's "02_creating_selectors.md" introduction.
Thanks for pointing me there too. You identified the "missing link" ;-)

From the scraperJSON-example:
"selector": "//meta[@name='citation_pdf_url']",
"attribute": "content",

This, I think, translates into
tagselect( "meta"."name"="citation_pdf_url" | data );
of any-dl syntax.

From the "02_creating_selectors.md"-doc:
//dl[@Class='article-license']//span[@Class='license-p']
would translate in any-dl to:
tagselect( "dl"."class"="article-license", "span"."class"="license-p" | ... );

where "..." must be substituted with the specification of what to pick out
(e.g. data, arg("foobar"), etc.).

Hope this clarifies, why I asked for interpretation of the json-files.
But even without having my own tools in mind, I would recommend, to not only mention XPATH in the scraperJSON-doc (mentioned only once), but also to add the link to the wikipedia-article there.

The best explanation of the json-files was in the "02_creating_selectors.md" text.
In the scraperJSON-description there are a lot of links to tools, but not the link to "02_creating_selectors.md".
At least for me, it worked as distraction to have links to the tools, because when reading about the syntax/format, I would like to know more about it; what tools it use does not help so much. (But as explanation, why scrapeJSON was developed, this may help. So other people may find it helpful.)
But the link to "02_creating_selectors.md" would (and did) really help in understanding scraperJSON!

So, I recommend adding a link to "https://github.com/ContentMine/ebi_workshop_20141006/blob/master/sessions/6_scrapers/02_creating_selectors.md"
to the document "https://github.com/ContentMine/scraperJSON", because that explains how the scraperJSON "works".
The tools then are examples of how/where the scrapeJSON is used.
But for people, who wants to understand the format itself, that is secondary, I think.
(And I think that is not only from a programmers view.)

I hope this possibly too-long answer contributes to enhancing the docs.

P.S.: Realworld-examples with results (including e.g. tar.gz of result-directory), as mentioned in the beginning of the thread, of course would help understanding too. Different people, different ways to learn...
...also such results could be used for testing purposes. A "diff -r" on the directories could be used to check results of different tools, or different versions of one tool.
Just as an idea...

@klartext
Copy link
Author

Another unclear point: what is done with all those elements? Not all are for download, so where does the information go to? Will the scraped information be saved as a json-file, as metadata?
At first I thought, they also have something to do with paper-selection.

But now it seems to me, that getpapers does paper-selection (give a search-query and you get a url-list), while quickscraper is using that list and just downloads the files.
As only quickscraper uses the scraperJSON-definitions, the paper-URLs then are already known.
So the scraperJSON-elements seem not to be used as paper-selectors, but are just informations, that can be gathered about a paper, and can - or will - be saved together with the documents?

An overview-doc for orientation would be nice.
A graphics could help a lot, IMHO.

@petermr
Copy link
Member

petermr commented Aug 10, 2015

Great - discussions like this are a significant way of taking things
forward...

On Mon, Aug 10, 2015 at 11:01 AM, klartext [email protected] wrote:

@petermr https://github.com/petermr regarding the documentation: some
links are going to nirvana, and some pics are not available. Also, some
*.md files are just empty.

I have copied in Chris Kittel who oversees the documentation

Regarding the need of software: what kind of software do you mean?
What did my comments show up as new software-need?
Isn't the content-mine stuff already a working collection of tools?

The publisher formats are very variable and in theory we need a new scraper
for each one. In practice much of this is normalised. So today we had a
need to scrape IJSEM which is an Ingenta journal. that might required new
software though RSU thinks his is generic enough to cover it. But there is
always the chance we may need something new

BTW: I saw one of your presentations. Very interesting, that the documents
also will get analysed.
This is reverse-engineering of pdf's. Hardcore! :-)

Certainly hard work!

I have some pdf-stuff that I would like to analyze that way (the papers
/archive from the BCL (Biological Computer Laboratory,
https://en.wikipedia.org/wiki/Biological_Computer_Laboratory)).

This is exciting and valuable but challenging. My guess is that much of it
is PDFs of OCR scans. Some of this is probably typewritten (even carbon
copy) , some may be print (with hot-metal). If this is scanned the results
are very variable.

I thought about using tesseract-ocr, but it seems, ContentMine already
provides a workflow/tools to do it easier.

No, we use Tesseract :-). We are about to measure the OCR-error rate. For
born digital PDFs I am reckoning ca 1% character error BUT these do not
suffer from:

  • contrast
  • variability of typeface
  • distortion (often severe)
  • variability of format

So it depends what you want to get from it. I am afraid we normally warn
people that this is very adventurous and will take a lot of their time.

ATH: yes, I was not familiar with it; I looked for this Wikipedia-article

by myself, after I read Blahah's "02_creating_selectors.md" introduction.

Thanks for pointing me there too. You identified the "missing link" ;-)

We ahve to pur that right, @ck

From the scraperJSON-example:
"selector": "//meta[@name https://github.com/name='citation_pdf_url']",
"attribute": "content",

This, I think, translates into
tagselect( "meta"."name"="citation_pdf_url" | data );
of any-dl syntax.

From the "02_creating_selectors.md"-doc:
//dl[@Class https://github.com/class='article-license']//span[@Class
https://github.com/class='license-p']
would translate in any-dl to:
tagselect( "dl"."class"="article-license", "span"."class"="license-p" |
... );

where "..." must be substituted with the specification of what to pick out
(e.g. data, arg("foobar"), etc.).

Hope this clarifies, why I asked for interpretation of the json-files.
But even without having my own tools in mind, I would recommend, to not
only mention XPATH in the scraperJSON-doc (mentioned only once), but also
to add the link to the wikipedia-article there.

We need a tutorial ChrisK

The best explanation of the json-files was in the "
02_creating_selectors.md" text.

In the scraperJSON-description there are a lot of links to tools, but not
the link to "02_creating_selectors.md".
At least for me, it worked as distraction to have links to the tools,
because when reading about the syntax/format, I would like to know more
about it; what tools it use does not help so much. (But as explanation, why
scrapeJSON was developed, this may help. So other people may find it
helpful.)
But the link to "02_creating_selectors.md" would (and did) really help in
understanding scraperJSON!

So, I recommend adding a link to "
https://github.com/ContentMine/ebi_workshop_20141006/blob/master/sessions/6_scrapers/02_creating_selectors.md
"
to the document "https://github.com/ContentMine/scraperJSON", because
that explains how the scraperJSON "works".
The tools then are examples of how/where the scrapeJSON is used.
But for people, who wants to understand the format itself, that is
secondary, I think.
(And I think that is not only from a programmers view.)

I hope this possibly too-long answer contributes to enhancing the docs.

It's great. We need to know what your and others want and then cater for
them

P.S.: Realworld-examples with results (including e.g. tar.gz of
result-directory), as mentioned in the beginning of the thread, of course
would help understanding too. Different people, different ways to learn...
...also such results could be used for testing purposes. A "diff -r" on
the directories could be used to check results of different tools, or
different versions of one tool.
Just as an idea...

Yes. As far as possible we try to have Test-Driven-Development where we
have to check against expected results. Unfortunately this is so dependent
on the original source , tests become fragile.


Reply to this email directly or view it on GitHub
#10 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@chreman
Copy link

chreman commented Aug 10, 2015

@klartext Just now we released a new tutorial on creating scraper-definitions, I hope this covers some of your questions. Feeback is highly appreciated!

@petermr
Copy link
Member

petermr commented Aug 10, 2015

Chris Kittel has just posted a draft tutorial on scrapers:

https://github.com/ContentMine/workshop-resources/blob/master/software-tutorials/journal-scrapers/journal-scrapers-tutorial.md

I think it would be very useful for him if you made comments.

P.

On Sat, Aug 8, 2015 at 7:13 PM, Richard Smith-Unna <[email protected]

wrote:

@klartext https://github.com/klartext


Reply to this email directly or view it on GitHub
#10 (comment)
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

@klartext
Copy link
Author

Hi,

this Tutorial is very good.
I wanted to give more detailed feedback, but have some lack of time.
So I did not read it complete, but until "Followables".

It's a good starting point, adressing many questions.

When I have time to read the rest of the document,
I can give more feedback, sending my notes.
Some things can be enhanced.

@chreman
Copy link

chreman commented Aug 18, 2015

Thank you, we're looking forward to your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants