Pydurma is a fast and modular collation engine that:
- uses a very fast collation mechanism (diff-match-patch instead of Needleman-Wunsch)
- makes it easy to define language-specific tokenization and normalization, necessary for languages like Tibetan
- keeps track of the character position in the original files (even XML files where markup is removed in normalization)
- has a configurable engine to select the best reading, based on reading frequency among versions, OCR confidence index, language-specific knowledge, etc.
It does not:
- use any non-tabular (graph) representation (à la CollateX)
- implement reajustments of alignment based on the distance between tokens (à la CollateX)
- detect transpositions
It does not yet:
- use subword tokenizers à la sentencepiece, potentially more robust on dirty (OCR) data than those based on linguistic features (spaces punctuation, etc.)
- allow configurable token distance function based on language-specific knowledge (graphical distance, phonetic distance) to reajust alignment (à la CollateX'
near_match=True
option) - implement STAR algorithm to find the best "base" between different editions
- have a clearly defined export format for the collated text or critical apparatus
We intend Pydurma to be used in large scale projects to automate:
- merging different OCR outputs of the same images, selecting the best version of each of them
- creating an edition that averages all other editions automatically
The name Pydurma is a combination of:
- Python
- Pedurma དཔེ་བསྡུར་མ།, Tibetan for critical or diplomatic edition
While Pydurma can be used in a classical Lachmannian critical edition process, its innovative design allows it to automate the process of variant selection.
Using this automated selection directly can easily be gasped at, but we want to defend this concept. The type of editions Pydurma can produce:
- are fully automatic and thus uncritical (uncritical editions?), but can be produced on a large scale (industrial editions?)
- do not try to reproduce one variant in particular as their base (in that sense are not diplomatic editions, perhaps undiplomatic editions?)
- are intended to be similar to the concept of vulgate ("a commonly accepted text or reading" MW)
- can be regenerated from different inputs (improved OCR, more versions, etc.) giving a different output, and are thus temporary (or at least, not meant to be definitive)
- optimize measurable linguistic soundness
- as a result, are a good base for an AI-ready corpus
A workflow using Pydurma can work on a very large scale with minimal human intervention, which we hope can be a game changer for under-resourced literary traditions like the Tibetan tradition.
Pydurma operates in three steps:
- Preprocessing
- Collation
- Variant Selection
Here is that process:
- text_alignment_tool
- CollateX
- MEDITE
- Juxta (sources, uses difflib)
- hypercollate
- helayo (paper)
- Versioning Machine
- nmergec ("en-merge-see")
- ocromore
- This was also done by Oliver Hellwig apparently, research to be one
- Sequence Alignment (Wikipedia)
- MSA: Multiple Sequence Alignment (Wikipedia)
- Smith–Waterman algorithm
- Needleman–Wunsch algorithm
- Spencer 2004: Spencer M., Howe and Christopher J., 2004. Collating Texts Using Progressive Multiple Alignment. Computers and the Humanities. 38/2004, 253–270.
Pydurma is licensed under the Apache license.
Pydurma is a creation of:
The intended use of Pydurma at the Buddhist Digital Resource Center was presented at the Digital Humanities Workshop & Symposium organized in January 2023 at the University of Hamburg (see summary of the symposium, slide selection available here).
@software{
Roux_Pydurma_2023,
author = {Roux, Elie AND Tenzin Kaldan},
title = {{Pydurma}},
url = {https://github.com/openpecha/pydurma},
version = {0.1.0}
}
@software{Roux_Pydurma_2023,author = {Roux, Elie and },month = jan,title = {{Pydurma}},url = {https://github.com/openpecha/pydurma},version = {0.1.0},year = {2023}}