Importing Works from the Internet Archive

FromThePage supports transcribing books hosted on the Internet Archive. This is a great way to explore documents that have already been digitized, and it's actually easier to set up than uploading scans directly to FromThePage.

Find the work to import

1. Log in as a user who is authorized to own works. 1. Click the Dashboard link (located next to the login link) 1. On the left side of the screen, you'll see an area called "Owner Actions".

Click Explore OAI Repositories

Click "Show All Sets" next to the Archive.org link

Wait a very long time (possibly several minutes) for FromThePage to query Archive.org for all its OAI sets. This is a very long list indeed, and I think I'll need to explore ways to streamline this process for general use. For your purposes, there's a short-cut after you go through this once.

Search the page for sandiegonaturalhistory. Click "Save for future use" next to the spec.

This should redirect you to the dashboard again. There should now be a link in the owner's section saying "List works to import from collection:sandiegonaturalhistory". Whenever you view the dashboard, you should see this link, so you won't have to go through the repository exploration step again.

Click the "List works to import" link. This will query Archive.org for the works it has in that OAI set, which is currently just the Klauber field notes.

Click the Import button beside one of the field notes. This switches from OAI-PMH code to the Archive.org API, fetching locations for the XML documents for that book. It imports all the relevant Archive.org information about the book, as well as information for each scanned leaf into FromThePage. This process takes a couple of minutes, and is a good target for future usability work. The import process adds this IA book to the user's staging area (accessible via the dashboard), and redirects straight to the Manage Import screen.

The Manage Import screen shows all the pages imported from Archive.org (alongside some debugging info on the right--ignore that), and provides the following three features:

Purge Delete Scans: Some leaves that Archive.org scans are classified as of type="Delete". These are apparently things like color calibration cards and such, and are never displayed by Archive.org. These should be purged, so press this button.

Retitle from OCR: this is unique to 20th-century daybooks like Klauber's. For these materials, the OCR has done a pretty good job of parsing the date that's printed at the top of each page. I've written code to re-title the numeric page numbers (which are really leaf titles) based on these parsed OCR entries. Press this button and wait a few minutes for the parsing to happen. You should see pages retitled with OCR. You'll have the opportunity to change the page titles later, so don't worry about the gibberish.

Convert to FromThePage: This converts an IA-imported book and its leaves into a FromThePage work with corresponding pages. This is the final piece of the IA book importer. It also takes a few minutes to run, and is a good candidate for usability work. Press this button and wait for a bit.

Once the converter is finished, you can access the work from the dashboard. The converter will redirect you directly into the standard FromThePage work settings editing screen. Data about the work can be edited by hovering over data and clicking to edit.

The other tabs that you will not have seen as a FromThePage end user are the Access and Pages tabs. I suspect that the most useful thing an IA work importer will do at this step is go to the pages tab and edit the titles for pages with bad OCR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing Works from the Internet Archive

Find the work to import

Clone this wiki locally