playing with a Selenium-based crawler

danja · Sep 16, 2017 · 16ccc0a · 16ccc0a
1 parent 09e4dff
commit 16ccc0a
Show file tree

Hide file tree

Showing 6 changed files with 859 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -58,6 +58,11 @@ Some background over here : https://dannyayers.wordpress.com/2014/12/30/browser-
 
 Apache 2 license.
 
+## 'Static' Rendering
+There are copies of the scripts used to render pages (index-static.html, core-static.js etc) with all links to editing facilities removed. This is to provide a static archive of the content. Making the archive this way is not straightforward as for the content to be visible, the Javascript has to be run in a browser. So I'm working on a [Selenium](http://www.seleniumhq.org/)-based crawler to sort this out (and dump the content as files).
+
+I've nearly implemented this, but it's since occurred to me that it would be easier to pull the content directly from the SPARQL store with a script, ignoring the browser rendering altogether.  
+
 ## Date Issue
 At some point I changed the date handling from a simple dc:date for each post to a dc:created and dc:modified
 
@@ -74,6 +79,5 @@ WHERE {
     ?s dc:date ?date
 }
 
-
 ## See Also
 I plan to use the same data model in [Seki](https://github.com/danja/seki) (middleware/a front-end for connecting to an independent SPARQL server using node.js) and [Thiki](https://github.com/danja/thiki) (Personal Wiki for Android Devices).
diff --git a/index.html b/index.html
@@ -139,7 +139,7 @@ <h3>Export Turtle</h3>
 
     <!-- input id="fileupload" type="file" name="files[]" data-url="/foowiki/upload" multiple -->
 
-    <div id="footer"><em>thoughtcatchers.org</em>
+    <div id="footer">hyperdata.it -Danny Ayers 2017 <em><a href="index-static.html">static</a></em>
     </div>
 
     </div>

diff --git a/utils/crawler_log.txt b/utils/crawler_log.txt
@@ -0,0 +1,7 @@
+Done scraping http://fuseki.local/foowiki/index-static.html
+Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/FooWiki%20Bugs
+Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/Image%20Testing
+Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/FooWiki%20Links
+Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/Basic%20Features
+Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/FooWiki%20To%20Do
+Done scraping http://fuseki.local/foowiki/page-static.html?uri=http://hyperdata.it/wiki/FooWiki%20Manual
diff --git a/utils/geckodriver-v0.19.0-linux64.tar.gz b/utils/geckodriver-v0.19.0-linux64.tar.gz
-Original file line number
+Diff line change
@@ Expand Up / @@ -139,7 +139,7 @@ <h3>Export Turtle</h3> @@
         <!-- input id="fileupload" type="file" name="files[]" data-url="/foowiki/upload" multiple -->
-        <div id="footer"><em>thoughtcatchers.org</em>
+        <div id="footer">hyperdata.it -Danny Ayers 2017 <em><a href="index-static.html">static</a></em>
         </div>
         </div>
@@ Expand Down @@