Skip to content

OpenPecha/Pydurma

Repository files navigation

README

Project description

Pydurma is a fast and modular collation engine that:

  • uses a very fast collation mechanism (diff-match-patch instead of Needleman-Wunsch)
  • makes it easy to define language-specific tokenization and normalization, necessary for languages like Tibetan
  • keeps track of the character position in the original files (even XML files where markup is removed in normalization)
  • has a configurable engine to select the best reading, based on reading frequency among versions, OCR confidence index, language-specific knowledge, etc.

It does not:

  • use any non-tabular (graph) representation (à la CollateX)
  • implement reajustments of alignment based on the distance between tokens (à la CollateX)
  • detect transpositions

It does not yet:

  • use subword tokenizers à la sentencepiece, potentially more robust on dirty (OCR) data than those based on linguistic features (spaces punctuation, etc.)
  • allow configurable token distance function based on language-specific knowledge (graphical distance, phonetic distance) to reajust alignment (à la CollateX' near_match=True option)
  • implement STAR algorithm to find the best "base" between different editions
  • have a clearly defined export format for the collated text or critical apparatus

We intend Pydurma to be used in large scale projects to automate:

  • merging different OCR outputs of the same images, selecting the best version of each of them
  • creating an edition that averages all other editions automatically

The name Pydurma is a combination of:

  • Python
  • Pedurma དཔེ་བསྡུར་མ།, Tibetan for critical or diplomatic edition

Note on possible workflows based on Pydurma

While Pydurma can be used in a classical Lachmannian critical edition process, its innovative design allows it to automate the process of variant selection.

Using this automated selection directly can easily be gasped at, but we want to defend this concept. The type of editions Pydurma can produce:

  • are fully automatic and thus uncritical (uncritical editions?), but can be produced on a large scale (industrial editions?)
  • do not try to reproduce one variant in particular as their base (in that sense are not diplomatic editions, perhaps undiplomatic editions?)
  • are intended to be similar to the concept of vulgate ("a commonly accepted text or reading" MW)
  • can be regenerated from different inputs (improved OCR, more versions, etc.) giving a different output, and are thus temporary (or at least, not meant to be definitive)
  • optimize measurable linguistic soundness
  • as a result, are a good base for an AI-ready corpus

A workflow using Pydurma can work on a very large scale with minimal human intervention, which we hope can be a game changer for under-resourced literary traditions like the Tibetan tradition.

Pydurma workflow

Pydurma operates in three steps:

  • Preprocessing
  • Collation
  • Variant Selection

Here is that process:

Preprocessing

image

Collation

image

Variant Selection

image

Previous work:

text alignment

OCR merging

  • ocromore
  • This was also done by Oliver Hellwig apparently, research to be one

See also

Need help?

Terms of use

Pydurma is licensed under the Apache license.

Acknowledgements and citation

Pydurma is a creation of:

The intended use of Pydurma at the Buddhist Digital Resource Center was presented at the Digital Humanities Workshop & Symposium organized in January 2023 at the University of Hamburg (see summary of the symposium, slide selection available here).

@software{
  Roux_Pydurma_2023,
  author = {Roux, Elie AND Tenzin Kaldan},
  title = {{Pydurma}},
  url = {https://github.com/openpecha/pydurma},
  version = {0.1.0}
}

@software{Roux_Pydurma_2023,author = {Roux, Elie and },month = jan,title = {{Pydurma}},url = {https://github.com/openpecha/pydurma},version = {0.1.0},year = {2023}}

About

Fast and modular collation engine with automatic variant selection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages