Skip to content
This repository has been archived by the owner on Apr 17, 2023. It is now read-only.

Off by one error between OCR and page image #117

Open
mzarozinski opened this issue May 23, 2016 · 1 comment
Open

Off by one error between OCR and page image #117

mzarozinski opened this issue May 23, 2016 · 1 comment

Comments

@mzarozinski
Copy link
Member

See document cu31924020438929

The actual book has: page 32, a full page image, blank page, page 33.
In Proteus book page number 33 is associated with the text for page 34.

https://archive.org/stream/cu31924020438929#page/n55/mode/2up
http://laguna.cs.umass.edu:2333/view.html?kind=ia-pages&action=view&id=cu31924020438929_55

@mzarozinski
Copy link
Member Author

This appears to be an issue with the transformation from rawtei to toktei.
In the DjVu and rawtei "page" (really image offset) 56 contains "o X". The actual page (https://archive.org/stream/cu31924020438929#page/n56/mode/1up) is an illustration, the next page is blank.

The Phokas program pulled the contents of that "page" into the previous page resulting in offset 56 being removed, followed by (correctly) offset 57 being blank.

The page index was built using toktei (list of document for that index is on sydney at /mnt/nfs/work3/michaelz/data/caribbean-via-grep.list). Proteus expects to see "o X" for offset 56 (the illustration) but that does not exist, resulting in the off by one error.

Attached are the rawtei and toktei files. Search for "" in the rawtei, and in the toktei to see the issue.

Ultimately the solution is to either fix Phokas or build the index using rawtei files. My experience has been that building from the rawtei files is the best way to proceed.

cu31924020438929.rawtei.gz
cu31924020438929.toktei.gz

@mzarozinski mzarozinski removed their assignment Dec 21, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant