Skip to content

Commit

Permalink
playing with a Selenium-based crawler
Browse files Browse the repository at this point in the history
  • Loading branch information
danja committed Sep 16, 2017
1 parent 09e4dff commit 16ccc0a
Show file tree
Hide file tree
Showing 6 changed files with 859 additions and 2 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,11 @@ Some background over here : https://dannyayers.wordpress.com/2014/12/30/browser-

Apache 2 license.

## 'Static' Rendering
There are copies of the scripts used to render pages (index-static.html, core-static.js etc) with all links to editing facilities removed. This is to provide a static archive of the content. Making the archive this way is not straightforward as for the content to be visible, the Javascript has to be run in a browser. So I'm working on a [Selenium](http://www.seleniumhq.org/)-based crawler to sort this out (and dump the content as files).

I've nearly implemented this, but it's since occurred to me that it would be easier to pull the content directly from the SPARQL store with a script, ignoring the browser rendering altogether.

## Date Issue
At some point I changed the date handling from a simple dc:date for each post to a dc:created and dc:modified

Expand All @@ -74,6 +79,5 @@ WHERE {
?s dc:date ?date
}


## See Also
I plan to use the same data model in [Seki](https://github.com/danja/seki) (middleware/a front-end for connecting to an independent SPARQL server using node.js) and [Thiki](https://github.com/danja/thiki) (Personal Wiki for Android Devices).
2 changes: 1 addition & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ <h3>Export Turtle</h3>

<!-- input id="fileupload" type="file" name="files[]" data-url="/foowiki/upload" multiple -->

<div id="footer"><em>thoughtcatchers.org</em>
<div id="footer">hyperdata.it -Danny Ayers 2017 <em><a href="index-static.html">static</a></em>
</div>

</div>
Expand Down
7 changes: 7 additions & 0 deletions utils/crawler_log.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Done scraping http://fuseki.local/foowiki/index-static.html
Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/FooWiki%20Bugs
Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/Image%20Testing
Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/FooWiki%20Links
Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/Basic%20Features
Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/FooWiki%20To%20Do
Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/FooWiki%20Manual
Binary file added utils/geckodriver-v0.19.0-linux64.tar.gz
Binary file not shown.
Loading

0 comments on commit 16ccc0a

Please sign in to comment.