-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create example scrapers with example results #10
Comments
Chris (copied) has written a tutorial on this which we released last week On Sat, Aug 8, 2015 at 3:57 PM, klartext [email protected] wrote:
Peter Murray-Rust |
|
Ah, I now realise that you perhaps mean that the example scrapers should come with example results - is that the case? If so, that's an excellent idea and I will make it a priority. |
Yes, I meant examples that has three things:
Just fom the json-files it's not clear, how to interpret them. |
Or recent workshop is covered in: https://github.com/ContentMine/workshop-resources/ You will find many of your questions addressed here. Suggest you work On Sat, Aug 8, 2015 at 9:21 PM, klartext [email protected] wrote:
Peter Murray-Rust |
@petermr But where does scraperJSON fit in? |
@klartext yes, I think we need some overview documentation. getpapers and quickscrape are both tools you can use to get scientific papers en masse for content mining. getpapers allows you to search for papers on EuropePubMed, ArXiv or IEEE. You get metadata of all the hits to your query. You can optionally also try to download PDF, XML and/or supplementary data for the hits, but not all papers are downloadable this way. quickscrape is a web scraping tool. You give it a URL and a scraper definition (in scraperJSON format) and it will scrape the URL using the scraper definition to guide it. We have a collection of scraperJSON definitions for major publishers and journals over at journal-scrapers. quickscrape is useful when you want to download things that (a) were in your getpapers results but getpapers couldn't get the PDF/XML/supp or (b) are not contained in the source databases that getpapers uses. So, getpapers can data, very fast, from a subset of the literature. Quickscrape can get the same data, it takes more work but you can theoretically use it on any page on the internet. |
@klartext as Richard says you need to read about On Sun, Aug 9, 2015 at 2:31 AM, klartext [email protected] wrote:
Peter Murray-Rust |
@petermr: OK, I read the quickscrape-tuorial. |
@klartext how about this: https://github.com/ContentMine/ebi_workshop_20141006/tree/master/sessions/6_scrapers It's the session on scrapers that I wrote for a workshop last year. It includes guides on creating selectors, as well as basic and advanced scraperJSON. We need to update some of these resources as there are now more features available, but that should get you started. |
@klartext, "But how to create / interpret the jsonSCRAPER-files? " If you aren't familiar with XPath see https://en.wikipedia.org/wiki/XPath. On Sun, Aug 9, 2015 at 10:15 PM, klartext [email protected] wrote:
Peter Murray-Rust |
@blahah thanks, that text did help a lot. I was not knowing XPATH-stuff in detail, so the selector-syntax looked strange to me. After reading that, I saw, what the definition is about. |
@petermr regarding the documentation: some links are going to nirvana, and some pics are not available. Also, some *.md files are just empty. BTW: I saw one of your presentations. Very interesting, that the documents also will get analysed.
Well, I already have a tool, which is quite generically and can achieve, what getpapers and quickscrape offer as seperate tools. (but I have no javascript-engine in the background, at least now) Regarding XPATH: yes, I was not familiar with it; I looked for this Wikipedia-article by myself, after I read Blahah's "02_creating_selectors.md" introduction. From the scraperJSON-example: This, I think, translates into From the "02_creating_selectors.md"-doc: where "..." must be substituted with the specification of what to pick out Hope this clarifies, why I asked for interpretation of the json-files. The best explanation of the json-files was in the "02_creating_selectors.md" text. So, I recommend adding a link to "https://github.com/ContentMine/ebi_workshop_20141006/blob/master/sessions/6_scrapers/02_creating_selectors.md" I hope this possibly too-long answer contributes to enhancing the docs. P.S.: Realworld-examples with results (including e.g. tar.gz of result-directory), as mentioned in the beginning of the thread, of course would help understanding too. Different people, different ways to learn... |
Another unclear point: what is done with all those elements? Not all are for download, so where does the information go to? Will the scraped information be saved as a json-file, as metadata? But now it seems to me, that getpapers does paper-selection (give a search-query and you get a url-list), while quickscraper is using that list and just downloads the files. An overview-doc for orientation would be nice. |
Great - discussions like this are a significant way of taking things On Mon, Aug 10, 2015 at 11:01 AM, klartext [email protected] wrote:
I have copied in Chris Kittel who oversees the documentation
No, we use Tesseract :-). We are about to measure the OCR-error rate. For
So it depends what you want to get from it. I am afraid we normally warn ATH: yes, I was not familiar with it; I looked for this Wikipedia-article
We ahve to pur that right, @ck
Yes. As far as possible we try to have Test-Driven-Development where we
Peter Murray-Rust |
@klartext Just now we released a new tutorial on creating scraper-definitions, I hope this covers some of your questions. Feeback is highly appreciated! |
Chris Kittel has just posted a draft tutorial on scrapers: I think it would be very useful for him if you made comments. P. On Sat, Aug 8, 2015 at 7:13 PM, Richard Smith-Unna <[email protected]
Peter Murray-Rust |
Hi, this Tutorial is very good. It's a good starting point, adressing many questions. When I have time to read the rest of the document, |
Thank you, we're looking forward to your suggestions. |
It would be nice to have real-world example-json files
together with directory/file-collection, which are created by
running a scraper with a certain scraperJSON-json file.
That would be helpful to implement a scraper that follows the scraperJSON scheme/policy.
A *.zip or *.tgz file for results (or json-file and results) would make sense as examples, IMHO.
The text was updated successfully, but these errors were encountered: