Skip to content

SPEC for new ingestion process

Asaf Bartov edited this page May 19, 2024 · 6 revisions

SPEC for new ingestion process

Overview

  • The overall goal: given a DOCX file containing one or more texts (usually drawn from a single volume (book), often not the entire volume), we want to:
    • convert the text to markdown
    • human editor identifies separations between different texts, if ingesting several texts at once (e.g. a group of poems or articles)
    • have an editor fix up the imperfect conversion to markdown
    • ingest the text(s) into the system (creating works, expressions, manifestations)
    • adding them to the appropriate collection (creating a new volume if necessary), and
    • associating the text(s) with the appropriate InvolvedAuthorities as author(s) and translator(s).

design principles

  • progress must be saved at each step, so that the process can be resumed if interrupted.
    • (meaning an interim entity is needed. Currently it's the historically-named HtmlFile entity, but this is a misnomer and should be renamed to something like Ingestible or IngestionSession)
  • warn user about existing combinations of title and authorship in catalog before completing ingestion

Flow

Ingestion prep

  • Optionally, ingestion can begin by a click in a task on the (separate) PBY tasks system, starting the ingestion form already with the DOCX from the task loaded and parsed.
  • convert DOCX to MultiMarkdown (using pandoc), save in Ingestible
  • To avoid duplications ask editor to identify author (authority) and select the correct collection/volume to add this to (if author not in the system, trigger a parallel process. Continue with flow but authority need to be completed before ingestion)
  • or create a new collection/volume if one does not exist in the system,
  • or upload it without creating a new volume (rare, confirmation required)
  • Note: for multi-author collection/volume, identify via volume name with auto-suggest, or create new collection/volume
  • For new collection/volume, if system heuristics identify a duplication with existing one – flag it and ask for confirmation
  • Add TOC for included texts (required step): 1. if TOC appears in the text we can upload volume in sections (assuming separate texts) -but have to start with the first section that includes the TOC. (In this case, upon the later digestion step, placeholders will be created for the missing texts) 2. if no TOC is included, need to wait until all sections of this volume are typed and combined into one file, and then manually create TOC by scanning entire volume, in this case there is no need for placeholders
  • In some cases, allow user to override the TOC request - and upload a section with just one placeholder for additional sections
  • identify the different texts in the buffer, using a magic delimiter
  • simple case: all texts are same genre, author, translator, edition details, and collection -- fill out these data once, and they apply to all. (This is supported today)
  • complex case: texts are different in any of the above ways -- fill out these data for each text (using a basic GUI within the markdown preview display) and store this complexity in Ingestible (in JSON?). (This is not supported today)
  • every text must be associated with precisely one collection, which may be an existing or a new one. Examples:
    • a text is part of a volume that is already in the system
    • a text is part of a volume that is not in the system (but is known bibliographically, as a Publication entity); the volume collection is created when the ingestion is launched; then the text is associated with it.
    • a text is not part of a volume at all (e.g. an uncollected article); it is associated with a new or existing collection called "Uncollected Works of author X" (or similar), itself under author X's root collection.
    • some of the texts are grouped internally into a sub-collection of the volume (e.g. a section in an essay collection, a cycle of sonnets); the sub-collections are created when the ingestion is launched, and the texts are associated with it. Those sub-collections themselves must belong to a volume collection.
  • if a text in the ingestion requires associating with a person/org not yet in the system, it should be possible to mark it as needing creation and to proceed with the rest of the ingestion prep. (Creating new authorities is something only specific volunteers do, and there are more volunteers working on ingestion.)
  • authorized editors can pick up an ingestion that's pending the creation of missing/unidentified authorities, create those authorities, associate the texts with the new authorities, and then launch the ingestion.

Ingestion launch

  • when the editor has perfected the ingestible (fixed up the markdown, filled out the metadata, modeled the collections correctly, and any authorities marked as pending (with a plain string placeholder) are created and associated), they launch the ingestion
  • ONLY THEN are the works, expressions, manifestations, and collections created in the system
  • the texts are associated with the appropriate collections
  • the texts are associated with the appropriate InvolvedAuthorities as author(s) and translator(s) (and others like illustrators, editors, etc - TBD)
  • placeholders are created and confirmed by the user
  • the ingestible is marked as complete, and the IDs of the created entities are stored in it (for review and ease of later undoing)