Skip to content

Latest commit

 

History

History
64 lines (37 loc) · 2.45 KB

README.md

File metadata and controls

64 lines (37 loc) · 2.45 KB

Million Dollar Web Page: Summary of Results

In response to Harvard Library Innovation Lab's work, an open dataset studying the Million Dollar Web Page (2005)

The Boing Boing and Harvard links do not seem to point to open data sources. Here we have an open set of links automatically extracted by the HTTPreserve suite.

We also have a CSV that explores the various responses from the web. Including earliest, and latest Internet Archive links.

The numbers don't seem to align with those of Harvard, but the process followed is not all that transparent.

Process followed:

  • Clean the Million Dollar Page HTML to remove anything not in the image map.

  • Extract links using Tikalinkextract

     $./tikalinkextract -noprotocol -quote -file million/ > mil.csv
    
  • Run the CSV through HTTPreserve workbench:

     $./workbench --csv --list mil.csv > mil_ia.csv
    

The process takes approx 138 minutes. No screenshots are created but they are available if wanted.

Links

Unique Links using Tikalinkextract: 3046 (Harvard: 2816)

IA Data

  • Earliest time the IA first archived a page from this site: 31 October 1996

  • Latest time the IA first archived a page from this site: 16 April 2015

    - http://www.tkqlhce.com/click-1772197-2831550
    - http://www.digden.net/index.php?affid=MDHP2
    
  • Last IA archived page: 05 July 2017

Response Codes and Counts

  • 0xx: 569
  • 2xx: 2258
  • 4xx: 187
  • 5xx: 31

HTTPreserve

The suite of tools can be found here: https://github.com/httpreserve

Try it here: http://httpreserve.info

Most workflows will involve just two, extract links using Tikalinkextract followed by analysis using Workbench:

The code is licensed under GPLv3.

Data License

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.