Added transcription helpers for extracting text from a canvas #15

stephenwf · 2024-04-23T20:12:17Z

Transcription helper.

Will find the following transcriptions:

VTT as rendering on canvas
Embedded Annotation page
External Annotation page
ALTO annotations (FUTURE)

Cookbook:

Plaintext rendering on canvas:

"rendering": [
  {
    "id": "https://fixtures.iiif.io/video/indiana/volleyball/volleyball.txt",
    "type": "Text",
    "label": {
      "en": [
        "Transcript"
      ]
    },
    "format": "text/plain"
  }
]

VTT annotation body on AV canvases:

"annotations": [
  {
    "id": "https://iiif.io/api/cookbook/recipe/0219-using-caption-file/canvas/page2",
    "type": "AnnotationPage",
    "items": [
      {
        "id": "https://iiif.io/api/cookbook/recipe/0219-using-caption-file/canvas/page2/a1",
        "type": "Annotation",
        "motivation": "supplementing",
        "body": {
          "id": "https://fixtures.iiif.io/video/indiana/lunchroom_manners/lunchroom_manners.vtt",
          "type": "Text",
          "format": "text/vtt",
          "label": {
            "en": [
              "Captions in WebVTT format"
            ]
          },
          "language": "en"
        },
        "target": "https://iiif.io/api/cookbook/recipe/0219-using-caption-file/canvas"
      }
    ]
  }
]

OCR annotations:

a motivation of supplementing,
the URI of the OCR file in the id property of the Annotation body, and
the target set to the applicable Canvas.

{
  "id": "https://iiif.io/api/cookbook/recipe/0068-newspaper/newspaper_issue_1-anno_p1.json-1",
  "type": "Annotation",
  "motivation": "supplementing",
  "body": {
    "type": "TextualBody",
    "format": "text/plain",
    "language": "de",
    "value": "I. 54. Jahrgang"
  },
  "target": {
    "type": "SpecificResource",
    "source": {
      "id": "https://iiif.io/api/cookbook/recipe/0068-newspaper/canvas/p1",
      "type": "Canvas",
      "partOf": [
        {
          "id": "https://iiif.io/api/cookbook/recipe/0068-newspaper/newspaper_issue_1-manifest.json",
          "type": "Manifest"
        }
      ]
    },
    "selector": {
      "type": "FragmentSelector",
      "conformsTo": "http://www.w3.org/TR/media-frags/",
      "value": "xywh=0,376,399,53"
    }
  }
}

OR
Linking Directly to an ALTO File. (FUTURE, NOT IMPLEMENTED)

"rendering": [
  {
    "id": "https://iiif.io/api/cookbook/recipe/0068-newspaper/newspaper_issue_1-alto_p2.xml",
    "type": "Text",
    "format": "application/xml",
    "profile": "http://www.loc.gov/standards/alto/",
    "label": {
      "en": [
        "ALTO XML"
      ]
    }
  }
],

It will produce a standard format for both temporal and plaintext/positional plaintext, including selectors.

interface Transcription {
  id: string;
  source: any;
  plaintext: string;
  segments: Array<{
    text: string;
    textRaw: string;
    granularity?: 'word' | 'line' | 'paragraph' | 'block' | 'page';
    language?: string;
    selector?: ParsedSelector;
    startRaw?: string;
    endRaw?: string;
  }>;
}

ParsedSelector include spatial and temporal information. Either from an annotation or from VTT (very simple parsing at the moment - external libraries for it are heavy). If there is just plaintext by itself, then there are no segments.

A viewer could start with just showing the plaintext, and then implement optional segments later.

Some new helpers too:

canvasHasTranscriptionSync() - checks if there is a transcription on a canvas without making any network requests
canvasLoadExternalAnnotationPages() loads and waits for external Annotation Pages
annotationPageToTranscription() - actual code for fetching the transcription - will also fetch all annotation pages. Recommended to use with Vault (to avoid multiple requests).

codesandbox-ci · 2024-04-23T20:12:37Z

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

stephenwf · 2024-04-23T20:46:38Z

At the moment, we are losing track of the Annotation target when parsing. It will very likely be the Canvas, but it could be

Canvas ID
Media id (complex timeline)
Choice ID (indicating it works with all choices)

And clients might need to check when they are providing navigation using the selector that it's got the right target.

stephenwf · 2024-04-24T12:26:30Z

Also need to pass in a language, so that the transcription can check for choices structured like this:
https://iiif.io/api/cookbook/recipe/0074-multiple-language-captions/

stephenwf · 2024-05-09T19:52:48Z

This still needs more testing, will leave open.

stephenwf added 2 commits April 23, 2024 21:05

Added transcription helpers for extracting text from a canvas

7900e41

Simplified shape of transcription

ebe1bde

stephenwf added 3 commits April 23, 2024 21:13

Added new package

149bfd9

Added tests

af0182e

Added fixture for annotations

b72a79c

stephenwf added 21 commits May 27, 2024 22:52

Added nav date helper for navigating by date

04c3726

Fixed

1c4345f

Updated test snapshot

031d072

Image profile bugfix

190d7e0

Added new range helpers for generating more usable trees for displaying

5eaa5ce

Added getAvailableLanguagesFromResource() helper to i18n helpers

4b5e9f9

Added annotation body

13af705

Added language property

2a610a6

1.2.0

1f903a2

Range tests + update parser with bug fix

23feeb8

1.2.1

40176f6

Search 1 helper and tests

11b16d8

Search1 fixes

54524c4

updated deps

e03b142

1.2.2

92a9b94

Fixed search1 autocomplete to allow empty store

e312d8b

1.2.3

92c9615

Added helper to find search service

aa903bb

1.2.4

2a939e2

Fixed types

6f0349f

1.2.5

1796e9b

stephenwf added 8 commits May 27, 2024 22:53

Fixed bug with setting search service

5d409d6

1.2.6

cffa6b1

Fixed search pending state

a8e98bb

1.2.7

779fd33

Search improvements, hit based

c5d6183

1.2.8

90d9409

Added new package

7fbef35

Fixed for transcription helper

91d6fa7

stephenwf marked this pull request as ready for review May 27, 2024 21:54

stephenwf added 2 commits May 27, 2024 22:54

fixed tsup

79a4be9

Merge main

67b47e9

stephenwf merged commit d7dee09 into main May 27, 2024
3 checks passed

stephenwf deleted the feature/transcription-helpers branch May 27, 2024 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added transcription helpers for extracting text from a canvas #15

Added transcription helpers for extracting text from a canvas #15

stephenwf commented Apr 23, 2024 •

edited

Loading

codesandbox-ci bot commented Apr 23, 2024 •

edited

Loading

stephenwf commented Apr 23, 2024

stephenwf commented Apr 24, 2024

stephenwf commented May 9, 2024

Added transcription helpers for extracting text from a canvas #15

Added transcription helpers for extracting text from a canvas #15

Conversation

stephenwf commented Apr 23, 2024 • edited Loading

codesandbox-ci bot commented Apr 23, 2024 • edited Loading

stephenwf commented Apr 23, 2024

stephenwf commented Apr 24, 2024

stephenwf commented May 9, 2024

stephenwf commented Apr 23, 2024 •

edited

Loading

codesandbox-ci bot commented Apr 23, 2024 •

edited

Loading